This book was taken from the Library of 
Extension Services Department on the date 
last stamped. It is returnable within 7 days. 
——————————— 
LODS iO АД SORT АТЦ cO OQ MEN TN RCM 


Measurement 


in 
Today's Schools 


THIRD EDITION 


Measurément 
in 


Today's Schools 


by 
С. С. Ross 


Ln 


Revised by 
Julian C. Stanley 


ASSOCIATE PROFESSOR OF EDUCATION 
UNIVERSITY OF WISCONSIN 


К A 10" ^" 


Depu., of Extedsion 
Services. 


Stour * 


THIRD EDITION 


Ww M INDIA ¢ tei S 


i Б. 
(aos Asaf Alt Road, Vw | 
\ NEW DELNY | 


PRENTICE-HALL, INC. 


& 
* 
"т 
= 


Сортвонт, 1941, 1947, 1954, вт 
PRENTICE-HALL, INC. 
70 Firra Avenue, New Уовк 
All rights reserved. No part of this book may be repro- 


duced in any form, by mimeograph or any other means, 
without permission in writing from the publishers, 


L.C. Cat. Card No. : 54-8846 


Piri printing. Eiaa. reed hene May, 1954 
рии... а С February, 1955 
мла dA CN 
M 1 F) к“ 
> ET ги v» 
Messi № aas. € ` ^ 
ap Yos va P gry 
Р > 
у 


PRINTED IN THE UNITED STATES OF AMERICA 


Preface to the Third Edition 


In this third edition of Professor Ross's book, the general framework of 
the earlier editions has been retained, but deletions, insertions, or other 
changes have been made on nearly every page to modernize the content 
and to inerease readability. 

“The Statistical Analysis of Test Results," Chapter 8 in previous edi- 
tions, has been rewritten almost completely and moved forward to become 
Chapter 3 and thus furnish needed background early in the book. Fifty 
five-option multiple-choice instructional items in Appendix A, with an- 
swers and explanations in Appendix E, and the review of square root com- 
putation in Appendix D were added to round out this material. 

The old Chapter 12, “Practice,” has been absorbed by Chapter 11, now 
called “Motivation and Practice as Related to Testing." The chapter on 
"School Marks" was dropped, since it overlapped other portions of the 
book and is now outdated. 

Chapter 17, “Some Present Trends,” and all six appendices are com- 
pletely new. Appendix B, “А Simplified Item-Analysis Procedure," is based 
upon hitherto unpublished studies by the writer. It sets forth, perhaps for 
the first time in an American elementary measurements book, a simple, 
illustrated, complete technique for determining the discriminating power 
and difficulty of each question in a test and several characteristics of the 
test scores. Appendix C, “Scoring Rearrangement (Ranking) Test Items," 
contains a table with values preferable to those commonly used for this 
type of item. Table F, “Publishers of Standardized Tests," replaces a 
shorter list formerly placed at the end of the final chapter. 

Three other additions should make the book easier to use: chapter sub- 
headings in the Table of Contents, a List of Tables, and an Author Index. 

I am indebted to a number of persons for assistance during the course of 
this revision. Mr. Gordon D. Mock checked many bibliographic entries, 
Professor Chester W. Harris supplied excellent comments concerning the 
items in Appendix A, Professor Robert L. Ebel made several helpful sug- 
gestions with reference to Appendices B and C, and Mrs. Margaret T. 
Aldridge and Professor Eric F. Gardner reviewed Chapter 3. 

In particular, I am grateful to my wife, Rose S. Stanley, for her pains- 
taking secretarial work and for otherwise expediting the revision. 


JULIAN C. STANLEY 


Preface to the Second Edition 


Since publication of the first edition of this book, experimentation in the 
field of measurement has made considerable progress. In this revision, the 
author has taken advantage of developments in this field, revising almost 
all the material from the first edition and including a great deal of alto- 
gether new material. All bibliographies and citations have been revised; a 
list of leading publishers of tests has been added to the final chapter. 

Chapter tests and exercises, incorporated directly into each chapter of 
the first edition, have now been compiled into a separate workbook. This 
arrangement is designed to save valuable time for the student working on 
the exercises and for the teacher correcting them. 

Sincere appreciation is due Mrs. Billy Whitlow Smith for considerable 
work both in the preparation of the manuscript and in correcting proof. 
Her assistance has been invaluable at every stage of the book’s progress. 


Preface to the First Edition 


It is doubtless true that more progress in measurement has been made 
during the past quarter of a century than during all the years preceding. 
But the pattern of the measurement books has remained much the same. 
They have been very definitely centered about subject matter. The treat- 
ment has usually been organized around the conventional school subjects, 
and much space has been devoted to lists and descriptions of the measuring 
instruments available. 

Authors of these texts have encountered obstacles increasingly difficult 
to surmount. The rapid increase in the number of tests and scales published 
has made it impossible to keep the books either complete or up to date. 
Even the most carefully compiled list of selected tests was likely to be ren- 
dered obsolete by the publication of better tests before the book was off the 
press. Fortunately, in recent years the appearance of rather complete and 
frequently revised bibliographies of published tests, together with critical 
evaluations, has made detailed lists and descriptions of available measuring 
instruments in textbooks no longer necessary. 

Meanwhile, instructors in measurement have manifested a growing dis- 
satisfaction with existing texts on the subject. For example, the typical 
class in measurement for high-school teachers has consisted of persons rep- 
resenting a variety of fields, but no one person has been interested in more 
than two or three of those discussed in the textbook, the rest of the material 
being largely deadwood. At the same time the enormous expansion of the 
experimental literature relating to measurement has had to be considered 
in any course that is at all adequate. And here the average book has left 
much to be desired. 

Fifteen years' experience in teaching educational measurement to college 
classes has led the author to attempt a functional approach to the subject. 
The present work is the outgrowth of this experience. The emphasis is, 
therefore, not so much upon the description of the tools themselves as upon 
the multitude of problems relating to their intelligent use and interpretation 
by classroom teachers and school administrators. 

It appears to the author that the time has come for a critical appraisal 
of measurement in today's schools, and for a careful search for generaliza- 

vii 


viii PREFACE TO THE FIRST EDITION 


tions to guide both theory and practice. The experimental evidence sup- 
porting these generalizations has been examined, and wherever possible 
reported in the language of the original author. 

Since the functions of measurement are much the same on all educational 
levels, the illustrations have been drawn from both the elementary school 
and the secondary school, and to some extent from college. It is hoped that 
the book will be found useful to teachers, and to prospective teachers, 
regardless of the subject or the level of instruction. 

In the preparation of the book the author has incurred obligations that 
are numerous and great. His first major indebtedness has been to his former 
teachers, notably Professors Edward L. Thorndike and William A. McCall, 
of Teachers College, Columbia University. The heavy obligation which the 
author owes to his co-workers in the field of measurement, upon whose 
publications he has freely drawn, is indicated by the numerous citations 
throughout the book. The fullest co-operation of these authors and their 
publishers is gratefully acknowledged. Special thanks are due to Professor 
A. B. Crawford, who has used a preliminary edition of the book at Tran- 
sylvania College and at the University of Kentucky, and who has made 
numerous constructive suggestions; and to Professor G. M. Ruch, of the 
United States Office of Education, who has read the manuscript, and whose 
pertinent criticisms have been invaluable. Finally, the author is indebted 
to his own students, who for three years have used a preliminary edition of 
the book and who have offered many suggestions that have contributed 
greatly to its improvement. 


C. C. Ross 


Table of Contents 


Part I 
THE PROBLEM OF MEASUREMENT 
CHAPTER PAGE 
1. MEASUREMENT IN THE Моревх Мові) . . . . . 3 


Measurement in science, 4. Measurement in education, 13. 


Tue Низтовсль DEVELOPMENT OF MEASUREMENT IN 
EDUCATION. . . 21 


Introduction, 27. The е of Vni Tibe ae a0. The history of 
achievement tests, 38. The history of character, personality, and interest 
measurement, 46, Some important publications, 52. Some relatively recent 
tendencies, 56. 


. THE STATISTICAL ANALYSIS OF Test RESULTS . . . 60 


General considerations, 60. Classification and tabulation, 61. Some ele- 
mentary notions concerning quantitative data, 69. Finding the mode, the 
median, and the mean, 75. Measures of variability or scatter, 81. Measures 
of relationship, 85. Measures of error, 101. Summary, 104. Instructional 
test items, 104. 


THE CHARACTERISTICS OF A SATISFACTORY MEASURING 


INSTRUMENT . . 106 


Introduction, 106. Validity, 107. Reliability, лат, Usability, 127. Bome 
generalizations regarding the problem of measurement, 131. 


Part II 


THE CONSTRUCTION OF 
TEACHER-MADE TESTS 


GENERAL PRINCIPLES OF Test CONSTRUCTION . . . 189 
Planning the test, 140. Preparing the test, 147. Trying out the test, 155. 
Evaluating the test, 159. 

PRINCIPLES OF CONSTRUCTING SPECIFIC Types or OB- 


JECTIVE Tests . . RGR 
Introduction, 163. Sinple-recall t tests, - 187. ЕА TOR. 170. nes 
native-response tests, 174. Multiple-choice tests, 179. Matching tests, 186. 
Rearrangement tests, 190. 

ix 


x 


CONTENTS 


CHAPTER 


1 


10. 


1i. 


12. 


13. 


14. 


15. 


THE CONSTRUCTION AND UsE or Essay EXAMINATIONS . 


Limitations of the essay examination, 193. Advantages of the essay ex- 
amination, 196. Suggestions for improving essay examinations, 197. 


Part III 
THE TESTING PROGRAM 


STEPS IN THE TESTING PROGRAM 


Determining the purpose of the program, 212. Selecting the appropriate 
test or tests, 214. Administering the tests, 225. Scoring the tests, 230. 
Analyzing and interpreting the scores, 234. Applying the results, 235. Re- 
testing to determine the success of the program, 236. Making suitable 
records and reports, 236. 


- THE GRAPHICAL REPRESENTATION OF EDUCATIONAL DATA 


The value of graphs, 247. Representing the record of an individual, 254. 
Representing a frequency distribution, 258. Representing two or more 
distributions, 264. General suggestions for constructing graphs, 271, 
Tue Uses AND LIMITATIONS or Norms 


Norms and Standards, 274. Raw scores and derived scores, 276. The use 
of norms in interpreting scores on intelligence tests, 279. The use of norms 
in interpreting scores on achievement tests, 290. Methods of comparing 
intelligence and achievement, 296. The use of norms in interpreting scores 
on personality tests, 299. 


Part IV 
MEASUREMENT IN INSTRUCTION 


Mortvarion AND PRACTICE AS RELATED TO TESTING 


The problem of motivation, 303. The relation of measurement to motiva- 
tion in teaching, 304, The relation of measurement to motivation in learn- 


ing, 306. Some educational implications of motivation studies, 325. Practice 
effect, 326. 


DIAcnosis SSE SA YS Ths ass See ek LN gU 
The problem of diagnosis in education, 328. The techniques of diagnosis, 
332. 


CLASSIFICATION AND PROMOTION 


The nature and educational significance of human variability, 347, The 
activity movement, 356. Homogeneous or ability groups, 357. 


EVALUATION IN GUIDANCE 


The meaning and importance of guidance, 367. The place of measurement 
in guidance, 369. Guidance is a co-operative venture, 370. 


EVALUATION oF SCHOOLS 


The problem of evaluation, 373, General principles of evaluation, 380, 
Evaluating various aspects of the school, 383. 


PAGE 


192: 


209 


247 


274 


303 


328 


347 


367 


373 


CONTENTS 


CHAPTER 
16. Ровілс RELATIONS 


Тһе problem, 400. Ordinary agencies a public NUBE 402. Official 
publications, 404. Report cards and letters to parents, 406. Other avenues 
of public information, 413. Mobilizing publie opinion, 414. 


17. Somm PRESENT TRENDS . . . . . . . . 


APPENDICES 


APPENDIX 
А. Firty Questions то Hate You Learn Sratisrics 


B. А SIMPLIFIED ITEM-ANALYSIS PROCEDURE 
Preparing the items, 436. A measure of discrimination, 437. A measure ot 


difficulty, 440. An illustrative analysis, 440. A discrimination table, 447. 
Obtaining the mean and the standard deviation, 452. A simplified proce- 
dure for obtaining a reliability coefficient, 452. 
C. SconiNG REARRANGEMENT (RANKING) Test ITEMS . . 
D. THE Computation ОЕ Square Roots . . . . . 
E. ANSWERS TO QUESTIONS IN APPENDIX A... 
F. PUBLISHERS OF STANDARDIZED TESTS . 
AUTHOR VINDES AI E hey DIC Bande 


SOBJECTPOINDEXON Re MI Pr eee lear а 


PAGE 


416 


429 
436 


454 
456 
459 
464 
467 
478 


List of Tables 


TABLE 


i 


The Estimated Grade-Value and Percentage Marks Assigned to an English 


xü 


PAGE 


Composition by One Hundred Teachers Meek иа 40 
. Percentage Values Assigned to Ten Essay Examination Papers by Twenty- 
п ау EUM ts vara. 42 
. A Class Record for a Reading Readiness Test . . . . . . . . . 62 
. Reading Readiness Scores from Table 3 Arranged in Order of Size and Rank 
Order andulabulaten и. оз 
. An Illustration of the Process of Making a Grouped Frequency Distribution 65 
. Distribution of Reading Readiness Scores for Six Schools in a Certain City. 66 
. The Chronological, Educational, and Mental Ages of the 20 Pupils in an 
ПЕ СООО ln MU uc ec MgCl A MT gy 
. A Two-Way Distribution of Mental Age and Educational Age for an Eighth- 
Edel uM САИ e de c IM. . gg 
. An Extremely Simple Scoring Method DM Uc NEUEM e 70 
. Five Consecutive Exploratory Steps in Assigning "Overall" Scores to 31 
Moglshe themes mete) СД 15. ME MEINE! mo 
. Dividiag the Four D's from the 11-Category Distribution of Table 10 into 
Three Parts a оса СОКО E УТИ . 74 
. The Process of Locating the Median 76 
. A Short Way io Compute the Mean of EE oer 80 
. The Process of Computing the Quartile Deviation, Q . 82 
. A Simplified Way to Compute the Standard Deviation . s BA 
. A Scatter Diagram Illustrating № egative Correlation Between Chronological 
Age and Educational Age for 20 Eighth Graders . AUGUE 87 
- Pretest and Midterm Scores of 43 Graduate Students on Two Teacher-Made 
Objective Tests in Intermediate Statistics . ri Dae E 91 
. Computation of the r Between Scores on Two Teacher-Made Tests . . . 92 
. The Computation of Rho for Each of Two Students Who Arranged Six His- 
torical Events in Chronological Order’. . . . . . . . . fe. 05 
. The Various Values of Rho (p) for All Possible Sums of Squared Deviations 
(212) for N’sfrom2throuh10 . . . . . . . PUR UIN 96 
. Estimating the Coefficient of Correlation by the Spearman Rank-Difference 
с, 9g 
. А Simple Expectancy Table, Based upon the 43 Pairs of Scores in the Table 18 
Scattergram, for Which r —.53 . . . . . BE OL ap 100 
. Effects of Constant and Variable Errors on Certain Types of Statistics 103 
- Intercorrelations of Intelligence Test Scores and Five-Semester Average 
Grades for 284 Seniors (124 Boys, 160 Girls) in a Los Angeles High School 110 
. Rankings of Test Items According to Frequency of Use as Revealed by Two 
Ваа. Musa Tol EE UL NORTE . 164 
. Plan for a Testing Program for the Elementary School . d E 210 
. Advantages and Limitations of Standardized and Nonstandardized "Tests of 
Achievement . . . . | , s f . 216-217 


LIST OF TABLES 


TABLE 


28. 
29. 
30. 
31. 
32. 
33. 


34. 
35. 
36. 
37. 
38. 


39. 
40. 


41. 


42. 
43. 
44. 
45. 
46. 


47. 
48. 


Classification of Tests in The Fourth Mental Measurements Yearbook (pem 

Otis Scale for Rating Standard Tests . . 

Cole-Von Borgersrode Scale for Rating Standardized Tests 
A Table for Computing Months Since Last Birthday 

Tat »le for Equating Intelligence Quotient Values 

Point Standing for the First and Second Semesters for "Low-Ranking Fresh- 
men Who Were Told Their Intelligence Test Scores as Compares with 
Those Who Were Not . Е 

Distribution of Spelling Difficulties and Successful Remedies 

Frequencies with Which Various Provisions for Individual Differences Were 
Reported in Use, or in Use with Unusual Success . 

Trends Toward Greater Provisions for Individual Differences i in Elementary 
Schools . . 

The Main Methods of Evaluation Used by the Co-operative Study of Sec- 
ondary School Standards, with the Weight Assigned to Each . . 

Composition of 1940 Edition of the АРЫ, Beta, and Gamma Scales for 
Evaluating Secondary Schools я 

Use of Tests in Evaluating Schools 

Summary of the Mort-Cornell Score Sheet for the ‘Self- ‘Appraisal of School 
Systems. . 

The Rank Orders of Thirteen Topics of School News According to the Inter- 
ests of 5,067 School Patrons Compared with the Space Devoted to These 
Topics by Ten Newspapers . 

The 100 Items in a Five-Option Multiple-Choice "Teacher-Made Test Ar- 
ranged According to Discriminating Power . 

Number of Examinees in High and Low Groups Who Chose Each Option of 
Item №. 380 . 

Number of Examinees i in Hish and Low Groups "Who Chose Each Option of 
Item No. 90. 

Number of Examinees in i High cesi ow Groups Who Cose Each Option 
of the 25 Least Discriminating Items 

Table for Determining Whether or Not a Given Test Item "Discriminates 
Significantly Between a “High” and a “Low” Group 85 

Formulas for Finding (Wz + Wz) Values at Three Difficulty Levels . 

ED? Table for Scoring Rearrangement Items. . . . . . .« © + + 


443 
448 


455 


List of Figures 


FIGURE PAGE 
IS Te ThE SRR RAE oM LANI с ruo. 35 
4. Tet o fronte ArmyvsBetR T Lees у wit no 020. 86 
3. A Scale for Measuring Pupils’ Attitudes Toward High School. . . mure 50 
4. The Relative Amount of Relationship Represented by r's of Various Sizes . 89 
5. IBM General-Purpose Answer Sheet . . . . . . . . . . . . 159 
6. An Illustration of the Procedure Followed in Scoring Test 3 of the Terman 

Group Test of Mental Ability, Form A . . . . . .. 232 
7. А Sample Standard Test Scoring Record. . . . . . . . . . . 233, 
8. An Educational Profile for a Standardized Achievement Test . . . . . 239 
9. Test Data Summary írom the Cumulative Guidance Record of the Depart- 


ment of Supervision and Curriculum Development of the National Educa- 


ROU Amo Lon NE Er a i Уо 
10. A Cumulative Record in Graphical Form. . . . . . . . . . . 242 
11. Centile Sheet for College Men and Women on Allport-Lindzey-Vernon Study 
OVORUM MU CHINA acum A eee | ay. 245 
12, Back to School, United States, 1900-1953. . . . . . . . . . 248 
13. A Rather Complex Bar Graph with High Attention Value. . . . . 250-251 
14. Motor Buses in Operation in the United States—Fifteen Different Charts 
Based upon theSame Data . . . . . . . . . . . . , 252-253 
15. Profile of a Pupil and the Sixth-Grade Class of Which He Is а Member . . 255 
16. The Profile of а Tenth-Grade Pupil on the California Achievement Test. . 256 
17. Profiles for a Student Tested in the Fifth and Sixth Grades . . . . l 257 
18, A Histogram, or Column Diagram, Representing the Percentage Values As- 
signed to an Arithmetic Paper by Forty-Two Scorers . . . 27 35. 1258 
19. A Histogram, or Column Diagram, Representing the Distribution of IQ's in 
somal Junior High опоо eea EP о 250 
20. A Frequency Polygon Representing the Percentage Values Assigned to an 
Arithmetic Paper by Forty-Two Scorers . . . . . . . "de 259 
21. An Actual Curve Compared with the Theoretical Curve of Probability . . 260 
22. A Percentile Curve Representing the Percentage Values Assigned to an Arith- 
metic Paper by Forty-Two Scorers . . . . . ... . . . "me 261 
23, A Percentile Curve Representing the Distribution of 83 TQ's in a Small Junior 
а То E LU MN QU T К. 261 
24. Negative and PositiveSkewness . . . , | . ense nes “262 
25. Bar Graph Made on the Typewriter, Showing the Distribution of 91 IQ's in 
Зоор Schoo) И c oat Mae un ЕТ 262 
26. Bar Graph Made on the Typewriter, Showing the Percentage of Pupils of 
Each Age Group Who Were Graduated from High School and the Per- 
=a centage Who Entered High School but Did Not Graduate . 3 263 


Graph Made on the Typewriter, Showing the Overl 
Eight, and Nine in Reading Comprehension . 


ХІУ 


apping of Grades Seven, 
rw Named Uv EE 204 


FIGURE 


28. 


29. 


30. 


31. 


32. 


33. 
34. 


LIST OF FIGURES 
PAGE 
Frequency Polygons Representing the Distribution of Reading Comprehen- 
sion Scores on the Iowa Silent Reading Teste for the Seventh, Eighth, and 
Ninth Grades of a Certain School . 265 
Total Comprehension Scoree on the Iowa Silent Reading Tests for the Seventh, 
Eighth, and Ninth Grades . . 206 
The Learning of Three Groups Compared, “One with Full Knowledge of Prog- 
ress, One with Partial Knowledge of Progress, and One with No Knowledge 
of Progress . . 266 
Correct and Incorrect Location ‘of the Norms i in. a Line Chart, Showing Me- 
dian Scores оп a Reading Test . 267 
Grade Profiles for the Seventh, Eighth, 'and Ninth Grades of a Certain Junior 
High School Made by Connecting the Median Scores on Each Part of the 
Stanford Achievement Test, Advanced Complete Battery, Form J . 268 
A Line Graph Showing the Medians and Quartiles for Grades Four to Nine, 
Inclusive, in Reading Comprehension . 269 
The Central Tendency and Variability in Educational ‘Age of Grades 2B to 
9A, Inclusive, in a Small City School System . 270 
. The Relation Between Standard Scores, Percentile Ranks, and Revised 
Stanford-Binet IQ's . 290 
. The Profiles of Two Pupils Who Made the Same Total Score on a General 
Achievement Test . a 292 
. A Profile Based Upon Local ‘Norms s 297 
. The Influence of Knowledge of Progress upon "Achievement i in a College Class 318 
. A Study of the Influence of Praise and Reais ra Achievement in Fourth- 
Grade and Sixth-Grade Arithmetic . apes se Coes 
. The Five Levels of Educational Diagnosis . 332 
. Analysis Sheet of Test 3, Metropolitan Achievement Teste, Form A Arith- 
metic Fundamentals, for a Fifth-Grade Clase in October. . 335 
. Traxler Chart of Suggested Diagnostic and Remedial Procedures i in Hand- 
writing Я 343-344 
. General Quality of 200 Secondary Schools as ‘Judged by Field Committees . 349 
. Distribution of Mean Scores of Seniors in Forty-Nine Colleges in Pennsyl- 
vania on a Test of General Academic Knowledge . . 350 
. Distributions of Composite IQ's on Forms L and M of the Revised Stanford- 
Binet Intelligence Scales for a Standardization Group of 2,904 Individuals 
of СА? 2 to 18 Years. . 351 
. À Suggested Technique for Evaluating. the Philosophy of a "Secondary School 388 
. Instructions for Using the Evaluative Criteria Developed by the Pope 
Study of Secondary School Standards . : 385 
. Summary of Evaluative Criteria for the Median Secondary School dis 386 
. An Evaluative Procedure for the Content of the Offerings in the Principal 
Subject-Matter Fields of a Secondary School . 387 
. The Computation of Three Measures of the Adequacy of. the Book Collec- 
tions in the Library of & Secondary School А 389 
. Evaluative Techniques for the Library Service of а Secondary ‘School 390 
. The Computation of the Summary Score for the Guidance Service of a Sec- 
ondary School . 391 
. An Evaluative Technique for ihe Quality ‹ of Instruction ша ‘Secondary 
School . . 393 
. A Suggested Informal Report to Parents . 409 
. A Report Card Used at the University of Chicago High School 412 


Measurement 
in 
Today's Schools 


THIRD EDITION 


PART I 


The Problem 
of 


Measurement 


1 


Measurement т the Modern World 


From birth to death almost every aspect of our daily lives is touched by 
measurement in its numerous forms. At birth the record of that important 
event is carefully made according to the nurse's watch. During the next 
few days measurements of the baby’s weight and temperature are part of 
the daily routine of the hospital. Ever afterward, whether in school or out- 
side, watches, clocks, balances, thermometers, money systems, and other 
forms of measurement play prominent roles in the life of every human 
being. 

The daily round of the typical American probably begins somewhat like 
this: He rises at a certain hour by the clock, bathes in water measured 
by the meter, and dresses in clothing of a standard size. He begins his 
breakfast with half a grapefruit sold by the dozen, and sweetened with a 
spoonful of sugar sold by the pound. He continues with a bowl of cereal 
and two cups of coffee, both generously mixed with cream or milk sold 
by the quart. He then looks at his watch, jumps in his car, and watches 
the speedometer as he hurries to his work, for which he is paid by the hour, 
day, week, month, or year. 

One day after another, year in and year out, he keeps this up until he 
may finally worry himself to death over a falling stock market, measured 
in dollars, and a rising blood pressure, expressed in points. Then the hour 
of his departure is accurately noted, he is measured for a casket, and the 
time of his funeral is set according to the calendar and the clock. Afterward 
his life span is recorded in the family Bible and carved on his tombstone. 
In the meantime his estate is figured in dollars, and his widow lives the rest 
of her days from the income computed in per cent. 

3 


4 . THE PROBLEM OF MEASUREMENT 


These common experiences are characteristic of the emphasis placed on 
measurement in the modern world. In fact, if all our various measuring 
devices were suddenly destroyed, contemporary civilization would collapse 
like a house of cards. 


A. Measurement in Science 


Apparently the chief problem of man has always been adjustment. As one 
writer puts it: “The civilization of a race is simply the sum-total of its 
achievements in adjusting itself to its environment.” The form of the prob- 
lem has indeed varied somewhat from time to time, and still more has the 
method of meeting it. For ages the ingenuity of man was directed toward 
gaining practical control over the universe about him. At first the process 
was the uncritical procedure of trial and error. This fumbling way early led 
into such blind alleys as alchemy, astrology, and magic. Later the seers 
and wise men began to attempt to put together these scattered bits of ex- 
perience and so, in the words of Omar Khayyám, “To grasp this sorry 
Scheme of Things entire." Thus was born philosophy. The nature of the 
problem had then shifted to understanding the universe, rather than merely 
gaining control over it. 

Scientific method. About three centuries ago there arose, with the 
experimental verification by Galileo of the laws of falling bodies, the 
method of modern science. Since that time man's quantitative conquest of 
nature has expanded not only into all branches of physics and chemistry 
but into biological and psychological phenomena as well. It is no exaggera- 
tion today to assert that science has revolutionized the material world in 
which we live. But it has done more than this; as Whitehead says, science 
has "practically recoloured our mentality."? As a distinguished chemist 
puts it: “Man’s inner and outer necessities, real or imagined, have made him 
both a Scientist and a Philosopher.” 3 5 

Both Ње content and the method of science аге important. The content 
of science consists of a continuously expanding body of systematized knowl- 
edge, which is the product of scientific method. The one constant and uni- 
versal feature of science is its method of arriving at knowledge. John Dewey 
asserts that “the heart of science lies not in conclusions reached, but in the 


method of observation, experimentation, and mathematical reasoning by 
which conclusions are established.” 4 " 


1 Hu Shih, “The Civilization of the East and the West," in Whither Manki ited 
AS edd sid de н? New York: Longmans, Green & Саа ае i 
o , Science and the M. : 
reste piri odern World, page 3. New York: The 
з Richard E. Lee, The Backgrounds and Foundations odern Science, i 
more: The Williams & Wilkins Company, 1935, RM i Ju cad 


* Thirty-Seventh Yearbook of the National Societ or the Stud; lucation, 
page 480. Quoted by permission of the “i Е н 


D Compare ae. Society. Bloomington, Illinois: Public School 


MEASUREMENT IN THE MODERN WORLD 5 


What, then, is the scientific method? Bertrand Russell suggests this 
concise formulation : “The essence of the scientific method is the discovery 
of general laws through the study of particular facta." In another volume 
Russell elaborates this statement:* 


In arriving at a scientific law there are three stages: the first consists in observing 
the прое s ; the irte arriving at a hypothesis, which, if it is true, would 
account for these facta; in deducting from this hypothesis consequences 
which can be tested by observation, 


Conant's definition emphasizes continuity :’ 


Science is an interconnected series of concepts and conceptual schemes that have 
developed as a result of experimentation and observation and are fruitful of further 
experimentation and observations. In this definition the emphasis is on the word 
"fruitful." Science is a speculative enterprise. The validity of a new idea and the 
significance of a new experimental finding are to be measured by the consequences— 
consequences in terms of other ideas and other experiments. Thus conceived, science 
is not a quest for certainty; it is rather a quest which is successful only to the degree 
that it is continuous. 

But what is the role of measurement in scientific method? From Rus- 
sell’s three-stage analysis above, it would appear that, although measure- 
ment has but little, if any, bearing on the second stage in the scientific 
method, it is closely related to the first and third stages. Measurement 
performs a useful function in determining what alleged facts really are 
facts, as well as providing an exact method of describing them. It is also 
indispensable in the final stage of testing and verification, which is usually 
by means of specially devised experiments. A critical treatise* on educa- 
tional measurement begins with this statement: “Measurement is the prin- 
cipal implement of science, changing that field of human endeavor from 
medieval gropings to a modern exactitude.” The relationship is stated by 
Smart in the following words:? 


Of course, it must not be forgotten that our experience of sense qualities, in per- 
ception, serves as the basis of all scientific endeavor, and that this qualitative aspect 
of things isin varying degrees of completeness assimilated in and through the higher 
categories of the several natural sciences. And this assimilation is effected largely 
through the process of measurement, which thus functions as the connecting link 
between mathematics and the other sciences, and which is only a higher, i ie. more 
precise and complete form of that double-sided process of comparison and discrimi- 
nation which begins on the qualitative level of experience. 


5 Bertrand Russell, “Science,” in Whither Mankind, op. cit., page 

* Bertrand Russell, The Scientific Outlook, page 57. New York: OW. W. Norton & 
Company, Inc., 1931. 

? James Bryant Coma Science and Common Sense, pages 25-26, New Haven: Yale 
University Press, 1951. 

* B. Othanel Smith, Logical dt of Educational Measurement, 182 pages. New 
York: Columbia University Press, 1 

‘Harold В. Smart, The Logic of УА page 200. New York: D. Appleton and 
Company, 1931. Used by permission of D. Appleton-Century Company. 


6 THE PROBLEM OF MEASUREMENT 


Brief attention will now be given to the relation of measurement to each 
of the principal divisions of science. The discussion will observe the con- 
ventional divisions, namely, the “pure sciences" and the “applied sciences,” 
though we should note that in his presidential address to the American 
Association for the Advancement of Science, Kirtley Mather emphasized 
the shortcomings of these terms:'? 


It is significant that when scientists today are philosophizing, they are more 
likely to distinguish between "fundamental research” and “technological develop- 
ment” than between “pure science” and "applied science.” The fact is, of course, 
that every item of scientific knowledge ever gained, in response to whatever motive, 
has been found sooner or later to have practical significance, either directly or indi- 
rectly, in human affairs. 


Pure science is distinguished from applied science primarily on the basis 
of purpose or motive, and one division of pure science is distinguished from 
another on the basis of subject matter. Although the distinction is not 
always clear-cut, in general it may be said that pure science aims primarily 
at understanding the universe, whereas applied science aims at predicting 
and controlling it. In Russell’s words, ‘Science, ever since the time of the 
Arabs, has had two functions: first, to enable us to know things and, second, 
to enable us to do things." 

He also says: "Science, as its name implies, is primarily knowledge; 
by convention it is knowledge of a certain kind. . . . Gradually, however, 
the aspect of science as knowledge is being thrust into the background 
by the aspect of science as the power of manipulating nature." 2 

Measurement in the physical sciences. How can the place of meas- 
urement in any particular branch of science best be determined? Perhaps 
chief reliance must be placed upon the testimony of outstanding scientists 
in the particular field and recognized historians of science. Astronomy is 
doubtless the oldest and among the most highly developed of the sciences. 
Although the rise of experimental science is usually dated from Galileo, 
who lived about 300 years ago, Boring describes two important experiments 
in astronomy which were made as far back as 2,200 years ago. In com- 
menting upon these early experiments, Boring” says: 


It is no mere accident that these first two important astronomical experiments 
made use of mathematics in the interest of measurement. Measurement provides a 
precision of differentiation and definition in observation that can be had in no other 
way; mathematics provides the necessary means of carrying measurements through 
a logical development to their consequences without loss of their precision. 


10 Kirtley Е. Mather, “The Common Ground of Sci itics," Sci 4 
109-174, February 20, mu Pris ite nd of Science and Politics," Science, 117: 


п Bertrand Russell, The Impact of Science 7 : Si 
a A pact of on Society, page 21. New York: Simon 

1? Bertrand Russell, Dictionar: 
Philosophical Library, 1952. 


№ Edwin G. Boring, A History of Experi 
Ў“ ! "perimental Psychology, pages 14-15. New York: 
The Century Company, 1929. Used by permission of Appleton-Century-Crofts, Inc. 


y of Mind, Matter, and Morals, page 227. New York: 


MEASUREMENT IN THE MODERN WORLD T 


The bearing of this point upon the development of science is thus stated 
by Westaway :!* 


The more that exact measurement enters into any branch of Science, the more 
highly is that branch developed. It is for this reason that Chemistry and Physics 
are so far in advance of Botany and Geology. And the reason why we can obtain 
so much clearer notions of, for instance, an area or a weight, than of, say, wisdom 
or chivalry, is because the former are measurable, the latter not. It is of the first 
importance in Science that we should, whenever possible, obtain precise quantita- 
tive statements of phenomena, and thus we see why it is that the introduction of a 
new scientific instrument so often leads to a marked advance in our knowledge. 


Of the physical sciences, physics is usually regarded as the most highly 
developed at the present time. Two outstanding figures in the develop- 
ment of modern physics were Lord Kelvin in England and Max Planck 
in Germany. Regarding the place of mathematics and measurement in 
physics, Lord Kelvin says:'® 


When you can measure what you are speaking about, and can express it in num- 
bers, you know something about it, and when you cannot measure it, when you 
cannot express it in numbers, your knowledge is of a meagre and unsatisfactory 
kind. It may be the beginning of knowledge, but you have scarcely in your thought 
advanced to the stage of science. 


Max Planck, regarded as one of the leading exponents of the important 
quantum theory in physics, declares that further progress in the physical 
sciences “will depend essentially on the development and wider application 
of our methods of measurement." The case for measurement in physics 
may very well rest upon the testimony of these expert witnesses, although 
practically every prominent physicist from Galileo and Kepler to Einstein 
could be called. 

Measurement in the biological sciences. The history of the various 
branches of science indicates clearly that not only are the biological sciences 
younger than the physical sciences, but also that measurement and math- 
ematies have occupied and still occupy a less prominent place in the bio- 
logieal sciences. The primary reason for this is doubtless the greater sim- 
plicity of the data in the physical sciences.!* 


“Е. У. Westaway, Scientific Method: Its Philosophical Basis and Its Modes of Appli- 
cation, pages 271-272. New York: Hillman-Curl, Ine., 1937. 

16 Silvio Fiala, “The Experiment and Its Role in the Theory of Knowledge,” Philos- 
ophy of Science, 18: 253-258, July, 1951. “The highest level in the sense of most precise 
correspondence between its definitions achieved up to now is in physics, perhaps the 
oldest in natural science. Physics starts from the idea of the experiment as the only 
means by which this correspondence might be achieved.” Page 254. 

18 Quoted by Ronold King, “Physics, Metaphysics and Common Sense," Scientific 
Monthly, 42: 311, April, 1936. 

п Max Planck, Where Is Science Going?, page 96. New York: У. W. Norton & Com- 
pany, Inc., 1932. 

в Julian Huxley makes this statement: “Sciences, like Empires, have their rise and 
their time of flourishing, though not their decay. Naturally, the order of their rise runs 
parallel with the complexity of their subject-matter. The physical sciences, being the 
simplest and most straightforward, were the first to start their triumphant career.” 
What Dare I Think? page 1. New York: Harper & Brothers, 1931. 


8 THE PROBLEM OF MEASUREMENT 


Certainly, however, biology has progressed far beyond the stage of 
authority that existed in the Middle Ages, when, for example, the question 
of the number of teeth possessed by the horse was the subject of heated 
debate in many contentious writings. “Apparently,” says Locy, “попе of 
the eontestants thought of the simple expedient of counting them, but 
tried only to sustain their position by reference to authority.” Locy rec- 
ognizes three somewhat overlapping phases, or stages, in the development 
of the biological sciences: namely, the descriptive, the comparative, and 
the experimental.” 

Without doubt, one of the most important generalizations of modern 
science is that of evolution through natural selection as set forth by Charles 
Darwin near the middle of the last century, and yet it will be remembered 
that his work was nonmathematical, consisting largely of the classification 
of vast amounts of data upon which these epoch-making generalizations 
were based. In fact, as far as method goes, it was largely an extension and 
refinement of that emphasized by Aristotle more than two thousand years 
earlier. 

Whitehead attributes the retardation of science in the Middle Ages 
largely to Aristotle’s emphasis on classification rather than measurement. 
Note this statement :?! 


But the biological sciences, then and till our own time, have been overwhelmingly 
classificatory. If . . . only the schoolmen had measured instead of classifying, how 
much they might have learnt! 


That Darwin’s cousin, Sir Francis Galton, took this position is clearly 
indicated by the following statement: Until the phenomena of any branch 
of knowledge have been subjected to measurement and number, it cannot 
assume the status and dignity of a seience."?? Galton, therefore, proceeded 
to introduce exact measurement and mathematical calculation into the 
theory of evolution. In later years these biometrical methods have been 
greatly extended by Karl Pearson, Spearman, Fisher, and others. The re- 
markable experiments of Mendel in heredity appeared at about the same 
time, although their value was not recognized until about 1900. Because of 
these pioneers, knowledge of heredity has become established on a definite 
mathematical basis.” Even earlier than Mendel’s time such noted physiol- 
ogists as Müller, Weber, and Helmholtz had been doing much the same 
for physiology.” After a survey of the development of natural science from 


№ William A. Госу, Biology and Its Makers (Third Editi 
York: Henry Holt & Company, 1922. Dur m Ner 

°° Ibid., page 443. 

n Alfred North Whitehead, op. cit., page 43. 

22 Sir Francis Galton, quoted by I. W. Howerth on page 1 of “Measurement of Mental 
d rye S Mie аран 15: 1-9, June 2, 1932. 

eslie C. Dunn tor), Genetics in the 20th Century; Essays on the P 88 of 

Genetics during the First 50 Years, 634 pages. New York: The M d void 1 

* William A. Госу, op. cit, page 192, · A SUSIE SDP saa DEL 


MEASUREMENT IN THE MODERN WORLD 9 


Aristotle to Fabre, which shows a definite trend from qualitative to quan- 
titative analysis, Peattie concludes: “In short, what science calls for today 
are life histories, and ecological studies—the precise measurement of the 
environmental factors and the inter-relations of organisms.” 

The various biological sciences have not been placed upon as definitely 
a quantitative basis as have physics and chemistry, largely because of the 
nature of their data. Some competent students of science think the biologi- 
cal sciences have moved too far in that direction. Whitehead," for example, 
expresses regret that "biology apes the manners of physics," while at the 
same time neglecting the unique character of its own subject matter, or- 
ganisms, which are incapable of analysis without the destruction of their 
essential nature. The Gestalt school of psychology in recent years has also 
registered a vigorous protest against the atomic conceptions of mind which 
experimental psychology took over from nineteenth-century physics.” 

Measurement in the social sciences. Measurement in the social sci- 
ences presents a difficult problem. 'The social sciences are not only newer 
than the natural sciences but their data are more complex. They study 
human beings, the most complex of all biological organisms, and their 
social relationships, which are far more complicated than purely individual 
responses. 

The genetic history of the social sciences has been described as follows:?* 


In the days of Aristotle, Plato, and Pythagoras, philosophy still embraced the 
exact, natural, and social sciences. At the beginning of the nineteenth century 
the exact and natural sciences—mathematies, astronomy, physics, chemistry, geol- 
ogy, biology—had already left their philosophical matrix and were rapidly develop- 
ing their own methods and techniques, while preserving a tendency to return to 
philosophy for an oceasional theoretical and speculative rehauling. But the social 
sciences—history, ethics, law, economics, psychology, religion, esthetics, anthro- 
pology (such as it was)—were still rocking in the metaphysical cradle of Mother 
Philosophy. One by one the babes emerged and learned to stand on their own feet 
and to talk their own language, even though their gait and vocabulary continued 
for a long time to bear traces of their maternal heritage. 


In the preface to The Seven Seals of Science," Mayer states his major 
thesis as follows: 


The central theme of the essay is that the sciences did not arise and could not 
have arisen simultaneously, that they form a well defined structure with mathe- 
matics at the bottom, that each later science built upon those that went before, that 
psychology is only now in process of becoming established, and that the social 


2 Donald Culross Peattie, Green Laurels, page 345. New York: Simon and Schuster, 
Inc., 1936. 

2в Alfred North Whitehead, op. cit., page 150. 

David Katz, Gestalt Psychology, Гіз Nature and Significance (Translated by Robert 
Tyson), 175 pages. New York: Ronald Press Company, 1950. 

28 William F. Ogburn and Alexander Goldenweiser, The Social Sciences and Their 
Tnlerrelations, pages 2-3. Boston: Houghton Mifflin Company, 1927. 

? Joseph R. Mayer, The Seven Seals of Science, page vii. New York: The Century 
Company, 1927. Used by permission of D. Appleton-Century Company. 


10 THE PROBLEM OF MEASUREMENT 


studies, if they are to be worthy of the name of science, must build upon the natural 
sciences and particularly upon geology, biology, and psychology. 


It is significant that the author, although a professor of economics and 
sociology, uses as title for the final chapter іс the book, “Social Science 
in the Making." 

Other writers take a somewhat more optimistic position, and many of 
them indieate specifically the direction the development of social science 
is taking and must take. For example, Ogburn and Goldenweiser? write 
as follows: 


Attention, finally, must be drawn to the increasing importance of statistical 
methods in the social sciences. . . . The extent to which social thought and theory 
will pass from the sphere of opinion, conjecture, and contemplative analysis to that 
of fact, knowledge, and control, will depend on their permeation by these scientific 
methods of measurement and statistics. 


Of course, there is nothing particularly new about this viewpoint. As 
early as 1798 Malthus attempted to put economies on a definite mathe- 
matical basis when he announced his celebrated, although erroneous, prop- 
osition that “population increases in a geometrical ratio, food in an arith- 
metical ratio.” A little later, Quetelet showed that the theory of probability 
could be applied to human problems such as insurance.” 

Barnes traces the history of sociology and concludes: “There is a general 
agreement that sociology can become a true science of society only in the 
degree to which it is able to appropriate and apply those exact methods of 
measurement and analysis which constitute the indispensable attributes 
of science in general.” On the other hand, Ellwood, an eminent sociologist, 
takes a wholly different position. His point of view is clearly stated in these 
words :* 


It would seem to me that as we ascend in the scale of life the view that science is 
quantitative measurement of objective conditions becomes less and less applicable, 
not only because measurement becomes more difficult, but because the subjective 
element plays a larger part. Even if the subjective element is capable of certain 
measurements and even if it is true that whatever exists exists in some quantity 
or number, nevertheless, it is obvious that where subjective elements play a large 
part, measurement becomes of less importance for accurate knowledge because it is 
confined to the superficial aspects of the total situation and fails to expose the nature 
of the process which is being investigated. This is especially true in the social sci- 


ences and in them measurement seems to me to play a róle secondary to other 
scientific methods. 


? William Е. Ogburn and Alexander Goldenweiser, op. cit., pages 8-9. 

* Helen M. Walker, Studies in the History of Statistical Method, pages 39-42. Balti- 
more: The Williams & Wilkins Company, 1929. 

? Harry Elmer Barnes, “The Development of Sociology," Scientific M. onthly, 35: 547, 
December, 1932. ; 

? Charles А. Ellwood, “The Uses and Limitations of the Statistical Method in the 
Social Sciences," Seientific Monthly, 37: 353-357, October, 1933. Page 353. 


MEASUREMENT IN THE MODERN WORLD n 


Tt seems fairly clear, therefore, that measurement and statistical analysis 
of quantitative data do occupy a prominent place in the social studies, 
although there is no general agreement as to just what this place is. There 
does appear, however, to be universal recognition that the problems are 
more difficult than those presented by the earlier sciences, and that their 
solution must be based at least in part on these other sciences, notably 
psychology. Measurement in psychology will be considered at some length 
in later sections. 

Measurement in the applied sciences. It has already been pointed 
out that the distinction between pure and applied science is not always 
easy to draw. In the beginning, science appears to have arisen in the 
service of certain basic human needs and desires.” Despite its humble 
origin, however, science soon ceased to be but a means to an end and 
became an end in itself. For about a century and a half following Galileo 
it became exclusively the pursuit of the learned, and hardly influenced the 
thoughts or habits of ordinary men at all. The emphasis had shifted from 
applied to pure science. Russell comments upon this fact as follows:** 


It is only during the last hundred and fifty years that science has become an 
important factor in determining the everyday life of everyday people. In that 
short time it has caused greater changes than had occurred since the days of the 
ancient Egyptians. One hundred and fifty years of science has proved more explo- 
sive than five thousand years of prescientific culture. 


The cycle is now complete. Science, which arose from man's stern neces- 
sity for meeting the ordinary problems of life, has now returned to serve 
again his practical needs. And nowhere is measurement more in evidence 
than in its practical applications. A competent observer® asserts that “опе 
can hardly think of a field of intellectual endeavor into which measurement 
has not crept, and surely there is none in which its influence has not been 
felt." 

Indeed, it is these practical applications of science that have impressed 
the mind of the layman in recent years. When he hears the word “science,” 
he is likely to think of the results of science, possibly some invention such 
as the radio or radar, rather than of the physics and chemistry that have 
made them possible. Probably a thousand persons would know of Marconi, 
who invented wireless telegraphy, to one who ever heard of Hertz or Max- 
well, whose pioneer work blazed the trail. 

The prominent place of quantitative measurement in these modern 
applications of science to engineering and industry is too well known to re- 
quire elaboration. Take a modern automobile as an example. Its mechanical 
parts are accurate to the thousandth part of an inch. Every detail has been 

u 07. John Dewey, How We Think, page 216. Boston: D. C. Heath & Company, 1933. 


3 Bertrand Russell, The Scientific Outlook, op. cit., pages vii-viii. 
и B. Othanel Smith, op. cit., page 2. 


12 THE PROBLEM OF MEASUREMENT 


subjected both to careful laboratory experimentation and to rigid tests 
on the trial grounds. The instrument board presents, as practical aids to 
the user, various devices for measuring gasoline, electric current, oil pres- 
sure, temperature, and speed of car, as well as perhaps a clock and a radio. 

It is instructive to study what the application of science has done for 
modern cookery. The ordinary untrained housewife still uses recipes with 
such vague directions as “season to taste," "add butter to the size of a 
walnut,” “cook in a moderate oven,” and so on. In contrast, the modern 
bakery accurately measures all ingredients, mixes them uniformly for a 
specified length of time, and then cooks them at a specified temperature 
for a definite time. This assures a predictable uniformity in the product, 
in contrast with the “luck” of the old-time cook. 

Medicine is an outstanding field in which many discoveries of pure sci- 
ence have been applied to the solution of practical human problems. 
Herriek? explains how the development and use of various instruments of 
precision have revolutionized medical diagnosis and practice. The measure- 
ment of blood pressure, body metabolism, and the physical and chemical 
analyses of the blood and other body fluids are as recognized techniques 
today as were height, weight, and temperature a generation ago. The die- 
titian in the kitchen measures the patient’s food from the standpoint of 
calories, minerals, and vitamin content with an accuracy approaching that 
of the pharmacist in compounding his medicines and of the nurse in ad- 
ministering them. 

Limitations of measurement. Before concluding this discussion of 
the relation between measurement and science, it may be well to note some 
of the difficulties and problems of measurement. It must not be assumed 
that the tools and techniques of measurement have been developed to a 
state of perfection, This is far from true even in physics and chemistry, 
where measurement has progressed furthest. 

Planck, an eminent physicist, offers the warning that “every number 
obtained by physical measurements is liable to a certain possible error.” * 
Westaway puts the matter in these words: “We may, in fact, look upon 
the existence of error in all measurements as the normal state of things." ? 
Doubtless, Bertrand Russell has the same idea in mind when he describes 
science as a “succession of approximations." 40 

In general, it may be said that the sources of error in measurement are 
due to the imperfections either in the measuring instruments themselves, 
or in the method in which they are employed. While both of these sources 
of error are subject to a considerable measure of control, neither can be 

1 James B. Herrick, “Changes in Internal Medicine Since 1900,” 
Medical Association, 105: 1312-1315, October 26, 1935. í 


Чел Planck, А Survey of Physice, page 92. New York: E. P. Dutton & Co., Inc., 


№ Е. W. Westaway, op. cit., pages 289-290. 
“Bertrand Russell, The Scientific Outlook, op. cit., page 65, 


Journal of American 


MEASUREMENT IN THE MODERN WORLD 13 


eliminated altogether. Three methods of controlling errors in measurement 
may be suggested: 


1. The improvement of existing measuring instruments, 

2. The devising of adequate methods of estimating or allowing for errors. 

3, The development of skill in applying the instruments of measurement 
so as to reduce errors to a minimum and in interpreting the results so as to 
take due account of the errors which cannot be eliminated. 


The first of these methods will be considered at some length in Chapters 
4 to 7, the second will receive attention in Chapter 3, while practically 
the entire book is concerned with the third. 

The limitations of existing measuring instruments do not detract from 
the importance of measurement, although they do add to its difficulty, 
The result is rather to set a special premium upon the skillful use of these 
instruments. As a rule, the cruder any tool is, the greater the skill required 
in its application, if satisfactory results are to be obtained. The early auto- 
mobile, for example, called for much greater skill in its successful operation 
than does its more highly perfected modern successor, 

Conclusions. What, then, is the relation between measurement and 
science? A few generalizations seem fairly clear: 


1. There is a direct relationship between the status of a science and the 
degree to which measurement has been developed in it. In the older and 
better established physical sciences, measurement occupies a fundamen- 
tal place; in the newer biological sciences, measurement occupies a less 
important place; and in the social sciences, the most recent group, meas- 
urement has made hardly more than a beginning, The evidence seems 
abundantly to support Westaway’s statement: “The more that exact 
measurement enters into any branch of Science, the more highly is that 
branch developed." 

2. The prominence of measurement in a science appears to be roughly 
in inyerse ratio to the complexity of its subject matter. Inert material 
seems inherently more susceptible to measurement than living organisms, 
Apparently the maximum difficulty comes in the case of man, particularly 
in his social behavior. 

3. All measurement is subject to errors. These errors are due to limita- 
tions in the tools as well as in the techniques of measurement. To ensure 
satisfactory results the greater the limitations of the former the greater 
the skill and insight called for in the latter. 


B. Measurement in Education 


Is education a science? Education is described sometimes as a phi- 
losophy, sometimes as a science, and sometimes as an art. Moreover, it i 


“F, У. Westaway, op. cit., pages 271-272. 


14 THE PROBLEM OF MEASUREMENT 


commonly assumed that these terms can be clearly distinguished from each 
other or even that there is a certain antagonism among them. Quite the 
contrary is the truth, however. They are most closely interrelated. As the 
role of measurement in education will doubtless be different if education 
is a science from what it will be if education is a philosophy or an art, 
some attention must now be given to the problem of the relationship of 
science to philosophy on the one hand and to art on the other. Benjamin’s 
comments are pertinent here:*2 


It is commonly recognized that the most significant difference between philos- 
ophy, on the one hand, and the more specialized studies, such as science, art, 
religion, education, and politics, on the other, is that philosophy is concerned with 
the critical examination of certain concepts and beliefs that are presupposed. by 
these more specialized studies. For example, the average scientist readily admits 
that, although his study is based upon certain beliefs in the uniformity of nature, 
the rationality of the world, and the relative attainability of truth, he is not himself, 
as a scientist, called upon to subject these notions to critical examination. He can 
argue either that such justification is not ordinarily required for the actual carrying 
on of science, or that if such justification is demanded it can readily be provided 
by the philosopher whose job is, by common consent, precisely the examination of 
the assumed notions of science. 


Perhaps the kinship of science and philosophy will be more apparent 
if approached genetically and historically. The early Greek thinkers, Plato 
and Aristotle for example, recognized no distinction between science and 
philosophy, both being joined by the common bond of love, the pure love 
of truth.’ During the Middle Ages, however, this once harmonious family 
found itself unequally and somewhat unhappily yoked together. Since the 
later Renaissance, one after another of the children born to this union, 
beginning with physics and chemistry, left the family tree and set up in 
business for themselves. This was a sort of “psychological weaning,” which 
doubtless performed a useful function for the time being. But, as often 
happens when discontented youth desires to assert its independence, science 
rebelled altogether and would have nothing at all to do with the parental 
wisdom of Mother Philosophy. This prolonged period of arrogant adoles- 
cence has continued to the present time. Fortunately, in recent years some 
of the wiser heads have somewhat patched up the family quarrel which has 


brought unhappiness to mother and daughters alike.” Gray describes the 
result in education ;*5 


“А. Cornelius Benjamin, “Science and Its Presi 
73: 150-153, September, 1951, Page 150. 
5 aanl R. Smart, op. cit, Chapter 1. 
or an excellent general discussion of this point of view, see: Max Black, “A Lend- 
Lease Program for Philosophy and Science,” Sctentifi М. onthly, 61: 165-172, September, Й 
1945. For an admirable Statement of the dangers of a narrow conception of a science of 
education, see: William A. Brownell, “Quantitative Research on Teaching and Learn- 
ing,” School and Society, 50: 847-856, December 30, 1939. 


^5 J. Stanley Gray, “А Neglected Phase of Bducati e 2 
tional Research, 29: 89-90, October, 1935. oo e Laert gm 


uppositions," Scientific Monthly, 


MEASUREMENT IN THE MODERN WORLD 15 


Educators are so over-zealous to become “scientific” and objective that they have 
ignored the fact that there is no less need for thinking in science than in philosophy. 
When research procedure neglects theoretical evaluation and interpretation, it is 
only partly scientific, 


Although competent students of education recognize that both philoso- 
phy and science are necessary for a complete act of thinking, each field is 


so broad as to make a certain division of labor necessary. The situation has 
been well stated as follows: 


If one specializes in the critical examination of educational theories, hypotheses, 
and generalizations in the light of data which are already available, we call him an 
educational philosopher. If one specializes in the solving of educational problems 
by making new appeals to experience through systematic, controlled and uncon- 
trolled observation, in field or laboratory, we call him an educational scientist in 
the classical sense of the term. 


But even the most competent philosopher is forced to take his science 
at second hand from the data made available by specialists in science, for 
it is too much to expect him at the same time to be an expert experimen- 
talist. Conversely, a competent scientist is forced to take his philosophy 
largely at second hand while he conducts his persistent search for new facts. 
Moreover, not only does a scientist have to borrow his philosophy, but 
much of his science also. 

This borrowing, though necessary, is risky business not only for the 
philosopher but for the scientist as well. Not only may he be unfortunate 
in what he borrows, but its meaning to him is inevitably colored by the 
mental background imposed by his specialty. The important point is that 
Science and philosophy are reciprocally related, as inseparably linked as are 
heredity and environment in the growth of a living organism. Buckingham 
states the relationship concisely: “Аз fields of human endeavor science and 
philosophy supplement each other. . . . Without philosophy, science is 
incomplete; without science, philosophy is barren." An educational phi- 
losopher puts the relationship as follows: “While philosophy must be the 
general to plan the grand strategy of education, it will need science as its 
staff обсег.’ Broadly conceived, education then is both science and phi- 
losophy. As a science it belongs to the group known as the social sciences, 
whose data are the most complex of all. While education is not, and doubt- 
less never will be, as thoroughgoing a science as physics or chemistry, it has 
nevertheless made more progress toward the scientific treatment of its 


~ № Carter V. Good, A. S. Barr, and Douglas E. Scates, The Methodology of Educational 


esearch, page 24, New York: D. Appleton-Century Company, 1936. А j 
"In commenting upon William James, Boring makes this disturbing observation: 


"Tt is too bad, but no one has ever yet succeeded in being both a good philosopher and 


a good experimentalist.” E. G. Boring, op. сй., page 502. _ 
"Br n м. “The Philosophy and Organization of Research,” School and 


Society, 29: 758, June 15, 1929. 
9 John S. Brubacher, Modern Philosophies of Education, page 87. New York: MeGraw- 


Hill Book Company, Inc., 1939. 


16 THE PROBLEM OF MEASUREMENT 


problems during the twentieth century than in all the centuries preceding. 

But what of art in relation to science and philosophy? Will Durant puts 
the matter clearly and graphically :* 

Every science begins as philosophy and ends as art, it arises in hypotheses and 
flows into achievement. Philosophy is . . . the front trench in the siege of truth, 
Science is the captured territory; and behind it are those secure regions in which 
knowledge and art build our imperfect and marvelous world. 

William H. Payne long ago suggested that “‘in the slow but sure evolution 
of human opinion, a science of education is beginning to emerge from the 
art of education.”*! But there is also a reciprocal relationship, for as 
Doughton® points out, “the sure foundation of an effective teaching art is 
a science of education." 

Science consists in knowing, while art refers to doing and implies skill and 
aesthetic excellence. An outstanding teacher might be either an artist or 
a scientist, although the ideal teacher must be something of both. No mat- 
ter how great the artist or with how much inspiration he wields the brush, 
the pigments are mixed according to formula. 

The place of measurement in education. What, then, is the rightful 
place of measurement in education, which is at once a science, a philosophy, 
and an art? The answer varies somewhat with the point of view of the 
observer. Naturally the role of measurement appears more important to 
those educators whose specialty is science than to those whose specialty is 
philosophy. At times these views become so divergent as to appear wholly 
irreconcilable. The quotations at the top of the next page, from outstand- 
ing early educational leaders in a single institution, will make this clear. 

It is always dangerous, of course, to detach a statement from its context. 
However, the strong language employed, containing such terms аз “thor- 
oughly,” “indispensable,” “final,” “fallacy,” “never,” and “always,” 
scarcely leaves room to hope that these statements are entirely harmonious. 
Doubtless, any attempt to interpret the above quotations should take into 
consideration the publication dates. It will be noted that views of the 
educational scientists represent an earlier period. They appeared soon after 
World War I, when the suecess of the Army Alpha test was still fresh in the 
minds of psychologists. Moreover, at that time, atomic physics still dom- 
inated all science, including psychology, and behaviorism was in the as- 
cendaney in America. The relatively more recent quotations from educa- 
tional philosophers reflect a newer atmosphere. The recognition of the 
principle of indeterminancy in physics has had a sobering effect on science 
in general, and followers of the Gestalt school of psychology in particular 


i is nl The Story of Philosophy, pages 2-3. New York: Simon and Schuster, 


`" William H. Payne, Contributions to the Science of Educati А 
B AB ence of Education, page 11. New York 
*? [sane Doughton, Modern Public Education, Its Philos 


hy and B e 
308. New York: D. Appleton-Century Company, 1935. пуз ID A 


MEASUREMENT IN THE MODERN WORLD 17 


TWO VIEWS OF MEASUREMENT IN EDUCATION 


EDUCATIONAL SCIENTIST 


Whatever exists at all exists in some 


amount. To know it thoroughly involves 
knowing its quantity as well as its qual- 
ity." 


Measurement is indispensable to the 
growth of scientific education. . . . The 
final answer to every educational ques- 
tion, except one, must be left to the 
educational measurer and must await 
the development of education as a sci- 
ence,5* 


EDUCATIONAL PHILOSOPHER 


Yet another false concept in the climate 


gripping the American scholar thwarts 
his study of Man. This is the fallacy that 
“I know only what I can describe quan- 
titatively . . . whatever exists, exists in 
some measurable amount.” “ 


And I should myself like further to con- 
clude that education can never become 
a science . . . always—so long as this 


world stands—will there be problems, 
nay regions of problems, with which the 
processes of “exact” science are insuffi- 
cient to соре.% 


have protested against what they regard as the atomic conception of mind. 
It is, therefore, quite possible that the present views of its two groups are 
much closer together than the above statements would indicate. However, 
the following statement from McCall strongly suggests that not all differ- 
ences have been ironed out; 


Certain extreme exponents of the organismie (often called Gestalt) view contend 
that any organism is more than the sum of its parts, and that adding test scores is 
like trying to make a man by sticking together a head, a trunk, two arms and two 
legs. But a reading score cannot be properly compared to one leg. It is not a broken 
off fragment of the mind. In a very real sense, a reading score tends to measure the 
entire organism functioning in that reading situation. 

Mental measurements are essentially similar to bodily measurements. If anyone 
proposed to abolish the making and use of the measurements of pulse, temperature. 
blood pressure, et cetera, we would call him crazy, and if anyone proposes to abolish 
the making and use of mental measurements, he, too, should be called—I hesitate 
to say what, since somehow I must manage to live with certain of my colleagues 
after this is published, but surely something other than an organismie philosopher 
or a Gestalt psychologist! 


Both scientists and philosophers attempt to test their generalizations 
before finally accepting them. With science the process is the straight- 
forward one of subjecting all such generalizations to rigid mathematical or 


58 Edward L. Thorndike, Seventeenth Yearbook of the National Society for the Study of 
Education, Part 11, page 16, 1918. = 

54 Harold Rugg, “The American Scholar Faces a Social Crisis,” The Sociat Frontier, 
1: 12, March, 1935. f | 

55 William A. McCall, How to Measure in Education, pages 7, 9. New York: The 
Maemillan Company, 1923. 

š William H. Kilpatrick, School and Society, 30: 48, July 13, 1929. 

57 The Test Newsletter, published by the Bureau of Publications, Teachers College, 
Columbia University, December, 1936. 


18 THE PROBLEM OF MEASUREMENT 


experimental verification. With philosophy the process appears more in- 
volved. Kilpatrick,’ for example, recognizes two distinct situations: “sim. 
ple prophecies” and “decisions on appropriate conduct or policy." For the 
former, he agrees that “ ‘verification’ is an appropriate term and measure- 
ment (when available) is a proper means of testing." For the latter, how- 
ever, he insists that “ ‘verification’ is not an appropriate term and tech- 
niques of measurement are not in themselves adequate." He continues: 
“Tn such cases the function of measurement is not to supplant or to supply 
decisions, but to furnish, regarding the working of the policy under review, 
more and better data, in the light of which a fresh and better decision can 
be made.” Apparently, then, whenever actual verification is possible, this 
philosopher at any rate is willing to assign to measurement the job of doing 
it; and even in the other cases he assigns to it the necessary, if humble, 
duty of providing at least part of the data required. Perhaps, then, it is 
not an unfair statement to say that the scientist always assigns to measure- 
ment a fundamental role, whereas the philosopher sometimes does so. How- 
ever, at all times the philosopher seems willing to ascribe to measurement 
an important, even if not a fundamental, place in education. 

Take the important matter of guidance, for example. Even its most 
enthusiastic supporter would hardly characterize guidance as a full-fledged 
science. Yet measurement provides some of the essential data in any sound 
guidance program. To describe a pupil as of weak scholarship and of low 
mentality is to leave his status vague and unsatisfactory. But to say that 
he has a percentile rank of 20 on the California Achievement Tests and 
an IQ of 84 on the Revised Stanford-Binet Intelligence Scale is to describe 
him in reasonably precise and meaningful terms. 

In this connection it is well to observe that measurement is always 8 
means to an end, and never an end in itself. A measurement is simply 
a quantitative description of observed data. The significance or educational 
implications of the measurement are rarely self-evident or automatic. As a 
rule, the true significance of the measurement can be determined only when 
it is seen in relation to other relevant factors and is fitted into the total 
pattern of the situation. The term evaluation, as distinguished from meas- 
urement, is often used to refer to the process of appraising the “whole” 
child or the "entire" educational situation. 

The three R’s in education. Everyone is familiar with the famous 
trinity of R’s in education, “Readin’, "Ritin', and ’Rithmetic.” These, 
of course, have to do with the content of education, the curriculum of 
instruction. There is also another series of R’s which is concerned with the 
process of arriving at, or at least of searching for, truth in education to 
Serve as a basis for theory and practice. Educators have, in general, em- 
ployed three principal methods of settling educational issues and of arriving 


58 William H. Kilpatrick, “The Relation of Philosophy to Scienti 32300 " 
nal of Educational Research, 24: 110-114, September 198 ie cientific Research." Jour 


MEASUREMENT IN THE MODERN WORLD 19 


at educational principles and policies. These constitute a new series of R’s, 
“Rhetoric, Reputation, and Research.” 

Historically the first of these methods may be termed that of Rhetoric. 
It is the method par excellence of politicians, although not unknown in 
education, especially among the reformers of every period. The method is 
usually most dangerous when used orally. It is too well known to require 
detailed discussion here. Abe Martin’s famous definition of an orator as a 
“public speaker not unduly hampered by the facts” indicates rhetoric’s 
limitations. The danger is that the personality of the speaker may outweigh 
the merits of the case, and the artistic form of the speech may have more 
influence than its content.9 Naturally measurement and quantitative data 
are usually irrelevant, if not positively in the way. As a matter of fact, 
a Speaker of the House of Representatives once attributed the decline in 
oratory chiefly to the ‘general diffusion in knowledge,” since “as a rule 
the more information a man has the less emotional he is, and the orator’s 
appeal was to the emotions far more than to the understanding." 60 

'The second method of determining educational theory and practice may 
be termed that of Reputation. According to it, the settlement of an educa- 
tional issue is the simple matter of finding out what has been said on the 
question by some persons whose reputations in the field are sufficiently 
great to make them accepted as authorities. This method has been the 
dominant one in education until recently and is still widely used. It has a 
legitimate and necessary place in education as it does in law and medicine, 
where one must rely for the solution of many practical problems upon the 
professional judgment of acceptable authorities. But the method is not 
without its dangers, which are so important as to warrant brief discussion. 

In the first place, the authority may be mistaken. Reputation is un- 
fortunately no guarantee of reliability. Until comparatively modern times 
the wisest persons were quite certain that the sun each day made a com- 
plete journey around our flat earth. Such divergent views on practically 
every phase of education as are expressed in current educational journals 
and in public addresses by our most eminent educational leaders are ample 
assurance that men of the highest reputation may be mistaken. In the 
second place, the authority may be misquoted. A few years ago at a meeting 
of the American Psychological Association a speaker quoted what pur- 


59 Irvin S. Cobb described a prominent Southern orator as one who can “make a song 
of a syllable and turn any reasonably long word into an anthem.” The Courier Journal, 
Louisville, Kentucky, June 16, 1936. 

6 Champ Clark, “Is Congressional Oratory a Lost Art?” Century, 81: 310, December, 

910. 


*: Educators, of course, have no monopoly here. Bertrand Russell reports the amusing 
but instructive example of Todhunter, the mathematician, who opposed the establishing 
of the first experimental laboratory at Cambridge, because he thought it was unnecessary 
for students to see experiments performed, since the results could be vouched for by 
their teachers, all of whom were men of the highest character, including many who were 
clergymen in the Church of England! 


20 THE PROBLEM OF MEASUREMENT 


ported to be a statement from an outstanding psychologist. At the conclu- 
sion of the address, this psychologist arose to explain that he had never 
made any such statement and in fact believed quite the contrary. It is too 
much to expect, however, that the authority will always be present to 
correct his alleged quotations. A third danger is that conditions may have 
changed so greatly that a statement once true may no longer be applicable, 
For example, George Washington’s warning against foreign entanglements, 
made in 1797, when the United States consisted of 16 states whose total 
population, barely 5,000,000, was separated from Europe by a broad 
Atlantic, need not be at all applicable to a nation of 48 states, with a 
population of more than 150,000,000, joined to Europe by modern agencies 
of communication. The method is thus seen to be beset by many dangers. 
It must, therefore, be used with caution. The necessity of extreme care in 
the selection of the authority cannot be overemphasized. It is usually wise, 
also, to examine the evidence that lies behind the statement and the con- 
ditions under which it was made. Reputation alone must not be thought of 
as adequate assurance of reliability. At best the reliance upon reputation 
involves considerable risk. It is always well to consider the circumstances 
under which the statement was made as well as the data upon which it was 
based. 

The third method of arriving at truth is that of Research. This method 
is comparatively recent in the history of man. The prestige of the orator 
and rhetorician in ancient Greece and Rome and the authority of Aristotle 
in the Middle Ages testify to the newness of the method of research. And it 
is more recent in education and the other social sciences than in the phys- 
ical and biological sciences. Its appeal is to the intellect and is based upon 
the facts in the case. It is the distinctive method of science and may be 
regarded as the only final method of settling an educational issue, A single 
illustration will indicate its Superiority over the earlier methods used. 

A practical problem in education is to determine the proper amount of 
time to be devoted to each subject taught. Educators have usually assumed 
that the results obtained are directly proportional to the amount of time 
expended. In fact, the thing seemed self-evident. In the closing years of the 
last century, however, an inquisitive American physician by the name of 
Rice undertook, apparently for the first time, to subject the question to 
scientific study. The subject chosen was spelling, and the procedure was 
extremely simple and direct. A uniform, although not standardized, spelling 
test was administered to schools in various parts of the country. Afterward 
the results, involving about 100,000 cases, were tabulated according to the 
amount of time devoted to spelling in the school program. Contrary to the 
usual assumption, Rice found little or no relation between the results 


* Someone has sugges 


ted that theories often conti li i ins 
hive ee ee continue to live long after their brain 


MEASUREMENT IN THE MODERN WORLD 21 


obtained and the time expended." Equally good spelling achievement was 
found in schools where a period of ten or fifteen minutes was devoted to the 
subject as in those where a period three or four times as long was allowed. 
Although considerable skepticism was manifested toward the Rice inquiry 
at the beginning, the evidence was so convincing as to compel assent. 
Today few schools allot more than fifteen minutes a day to spelling in the 
school program. 'The solution of practical problems in education by the 
method of research had thus made a promising beginning. 

Forty-one years later Tyler summarized the educational situation as 
follows :** 


The proceedings of educational associations during the latter part of the nine- 
teenth century indicate clearly an attempt to settle teaching problems by argument, 
by impassioned pleas, or by consensus. The achievement-testing movement pro- 
vided a new tool by which educational problems could be studied systematically 
in terms of more objective evidence regarding the effects produced in pupils, The 
hope that problems could be settled by reference to fact rather than subjective 
impression or emotionally colored opinions has probably been the strongest influ- 
ence of the achievement-testing movement in the past forty-one years. 


Good, Barr, and Seates® classify research methods in education under 
four headings, historical, normative-survey, experimental, and other meth- 
ods. Of these four methods only the historical is not dependent upon meas- 
urement in some form, and even this method is likely to make use of 
numerical data. 

The function of measurement in instruction and in school ad- 
ministration. The foregoing discussion has been concerned primarily with 
the role of measurement in educational research, or in education considered 
as а science. But education is an art as well as a science. It has its practical 
as well as its theoretical aspects. It is a primary purpose of this book 
to consider such immediately practical problems as the relationship of 
measurement to the actual administration of schools and the instruction 
of pupils in these schools. Later chapters will consider measurement in in- 


8 Joseph M. Rice, “The Futility of the Spelling Grind,” Forum, 23: 163-172, 409-419, 
April and June, 1897. Later studies have found a similar lack of relationship in other 
subjects. See Merrill T. Eaton, “А Survey of the Achievement in Social Studies of 
10,220 Sixth Grade Pupils in 464 Schools in Indiana," Bulletin of the School of Education, 
Indiana University, 20: No. 3, 1944. 68 pages. 

Rice was an important, progressive education pioneer. During the years 1891-1899 
he investigated American education relentlessly and published a series of 20 articles in 
the Forum which even today sound startlingly modern. The first was “Need School Be 
a Blight to Child-Life?" (12: 529-535, December, 1891), the last “Why Teachers Have 
No Professional Standing" (27: 452-463, May, 1899). For further historical perspective 
see Douglas E. Seates, “Fifty Years of Objective Measurement and Research in Edu- 
cation," Journal of Educational Research, 41: 241-264, December, 1947. 

9 Thirty-Seventh Yearbook of the National Sociely for the Study of Education, Part ТІ, 
“The Scientific Movement in Education," page 349. Quoted by permission of the So; 
ciety. Bloomington, Illinois: Public School Publishing Company, 1938. 

& Carter У. Good, А, 5. Barr, and Douglas Е. Scates, ор. cit., 882 pages. 


af Extension 


DUCATION Fop 
& 
2 


ж. ` 


22 THE PROBLEM OF MEASUREMENT 


struetion and measurement in school administration. There is some over- 
lapping among these two major divisions, each of which will be subdivided 
still further. But the organization is a convenient one, even if somewhat 
arbitrary. 

It must be admitted at the outset that up to the present time research, 
while not entirely lacking, is by no means sufficient to prove conclusively 
that measurement really serves the above practical functions. However, 
though experimental evidence for the practical value of measurement is 
somewhat meager, objective support for the view that measurement is use- 
less or harmful is virtually nonexistent. The present case for measurement 
in education rests to a large extent upon the testimony of experienced 
ieachers and school administrators, and the argumentative ability of per- 
sons enjoying the highest reputation in the field. This is one of the many 
points in education where further research is needed. 

Examples of existing experimental evidence as to the value of measure- 
ment in instruction are the following:9 


Schutte found that normal school students who expected final examinations did 
significantly better than those who did not. Kulp found weekly tests increased 
the amount learned in educational sociology by about 17 per cent. Turney found 
that educational psychology students who took twelve short tests did about 
20 per cent better than others who took only the mid-term and final examinations. 
Jones found that psychology students who took a five-minute test after each lecture 
retained after eight weeks approximately twice as much as those who did not. Keys 
found that the same tests administered in the form of weekly rather than monthly 


examinations in educational psychology gave an immediate superiority of 12 per 
cent. 


Great strides have recently been taken at all school levels—elementary, 
secondary, and college—toward relating tests to educational objectives. 
These are set forth in considerable detail by 20 specialists in a highly 
important book, Educational Measurement.’ Trends in the research litera- 
ture are summarized frequently by the Review of Educational Research, 
whose issues contain rather comprehensive bibliographies. 


55 Three typical articles illustrating this point of view are: Hans C. Gordon, “How 
Teachers Use Test Scores to Understand the Needs of Pupils.” In Growing Points in 
Educational Research, 1949 Official Report of the American Educational Research Asso- 
ciation, pages 276-278. Washington, D. C.: National Education Association, 1949; 
Aileene Lockhart, “Testing Can Improve Teaching,” Journal of Health and Physical 
Education, 19: 627-629, November, 1948; Jane E. McAllister, "Tests and Measurements 
ipe d Relations," Educational Administration and Supervision, 35: 49-58, Janu- 

У, . 

97 These studies and other related studies will be discussed more fully in Chapter 11. 

68 E. F. Lindquist (Editor), Educational Measurement. Washington, D. C: ен 
Ee ee me 1951. 819 pages. Е 

aul Blommers (Chairman), “Methods of Research and A raisal in cation," 
21: 323-501, December, 1951; J. Raymond Gerberich (Clialrman), еен and 
Psychological Testing,” 20: 1-99, February, 1950; Frederick B, Davis (Chairman), 
Educational and Psychological Testing,” 23: 1-110, February, 1953. 


MEASUREMENT IN THE MODERN WORLD 23 


Types of measurement in education. The various types of measure- 
ment employed in education may be classified on different bases and from 
different. points of view. The following appears to be a reasonably satisfac- 
tory classification of the instruments of measurement employed in the 
ordinary school for distinctly educational purposes: 


A. Oral. 
B. Written. 
1. Informal (nonstandardized). 

a. Essay. 

b. Objective. 
2. Formal (standardized). 

a. Achievement. 

(1) General (survey). 

(2) Specific (diagnostic, practice, eto.). 
. Intelligence. 

(1) General (individual and group). 

(2) Specific (aptitude or prognosis). 
c. Personality and Interests. 


ND 


c 


The distinetion between the major categories, oral and written, is ob- 
vious. The distinction between informal and formal written tests is also 
easy to make. A formal test often begins as an informal test, which is later 
subjected to experimental trial and revision, only the best items surviving 
the process. Formal tests also have carefully worded instructions both for 
administering and scoring and, usually, norms for interpreting the results. 

The distinctions among tests of achievement, intelligence, personality 
and interests are not so clear-cut, however. By the term achievement tests is 
meant tests of academic achievement, such as arithmetic or algebra; they 
are distinguished from most personality and interest tests, which are self- 
report rather than ability measures.” Intelligence tests, theoretically at 
least, are measures of learning capacity, whereas achievement tests are 
measures of learning itself." In other words, intelligence tests attempt to 
measure educability, while achievement tests attempt to measure educa- 
tion. The writers have followed the usual practice of recognizing tests of 
achievement and intelligence as co-ordinate with tests of personality and 
interests. Strictly speaking, however, achievement and intelligence are 
merely aspects of personality, which is a term used by psychologists to 
include every trait that differentiates one individual from another. In a 
certain sense, then, every test is a test of personality; and many aspects of 
personality cannot be measured by tests at all, but are evaluated by means 


"Lee J. Cronbach, Essentials of Psychological Testing, pages 14-15. New York: 
Harper & Brothers, 1949. 

п Tilton comes to the interesting conclusion that his data “are fairly consistent with 
а concept of a general ability to learn and with the identification of it with the ‘general’ 
intelligence test." John W. Tilton, “The Intercorrelations between Measures of School 
Learning,” Journal of Psychology, 35: 169-179, January, 1953. Page 179. 


24 THE PROBLEM OF MEASUREMENT 


of rating scales, questionnaires, interviews, controlled observation, and the 
like. 

Tests are subdivided into general and specific on the basis of scope. They 
may also be further subdivided into individual and group tests on the basis 
of method of administration, and into verbal and nonverbal or performance 
tests on the basis of content. A distinction is often made, although not 
always observed in practice, between a test and a scale. A test consists of 
a series of questions to be answered or exercises of some sort to be done, 
and a pupil's performance is the number of these he is able to do in the 
time allotted. Strietly speaking, a scale consists of a series of specimens, 
such as handwriting, for example, arranged in order of merit, and the 
pupil's performance is judged by comparing it with the standard specimens. 
Most standardized instruments of measurement are really tests rather than 
scales. The test items are frequently arranged in order of difficulty, how- 
ever, in which case the term scaled test is sometimes used to distinguish such 
tests from those in which the items are not so arranged.” 

Of course, many other types of measurement are employed in schools. 
Examples of these are chronological age, height, weight, temperature, and 
time, but these can hardly be classified as strictly educational. It cannot 
be too strongly emphasized that measurement is not limited to tests and 
examinations, and certainly not to standardized tests. There are also nu- 
merous rating scales and check lists for playgrounds, buildings and equip- 
ment, and so forth, whose use is largely restricted to the specialist. These 
have been omitted in the interest of brevity. 

It must be recognized that recent tendencies in education have enlarged 
its scope and increased its complexity, and have thereby added to the dif- 
ficulties of teaching and administration. But nowhere have these difficulties 
been more apparent than in the problem of measurement. The need for 
proper evaluation is as great in the modern school as ever before, but the 
difficulties of providing for it are vastly greater. For, as Saucier points out, 
“an instrument of measurement may meet all the criteria for measuring a 
reactionary, undemocratic conception of education but at the same time 
be valueless for measuring the major results of a progressive, democratic 
theory of education." This means that as the schools improve, so must 
the tools and techniques of measurement and evaluation. 

A quotation from Scates seems to sum up the central theme of this 
chapter quite aptly: 


™ For an extensive discussion of modern scale analysis see Samuel A. Stouffer and 
others, "Measurement and Prediction,” Volume IV of Studies in Social Psychology in 
World War 11, 756 pages. Princeton, N. J.: Princeton University Press, 1950. 

73 W. A. Saucier, Introduction to Modern Views of Education, pages 368-369. Boston: 
Ginn and Company, 1937. 

B Douglas E. Scates, “Fifty Years of Objective Measurement and Research in Edu- 
cation,” Journal of Educational Research, 41: 241-264, December, 1947. Pages 253-251. 


MEASUREMENT IN THE MODERN WORLD 25 


What has the measurement movement done for us? One way of responding to 
this question is to note what we should not have if we had no such measurements. 
Certainly we would not have much of our current scientific work in edueation. For 
there cannot be a science without fairly precise quantification; not that science is 
measurement, but that traits which are devoid of an: reasonably definite quality 
simply do not have the required specificity for entering into the careful thinking 
essential to science. When quantities are disregarded almost any generalization 
becomes true. There is an infinitesimal element of truth in practically anything 
which might be said. Quantities are part of the nature of truth and are therefore 
an essential part of science, 


SELECTED REFERENCES FOR FURTHER READING 


(References cited in the footnotes to Chapter 1 are not repeated here.) 


Barr, Arvil S., Davis, Robert A., and Johnson, Palmer O., Educational Research and 
Appraisal. New York: J. B. Lippincott Company, 1953. 362 pages. 

Boring, Edwin G., A History of Experimental Psychology (Second Edition). New 
York: Appleton-Century-Crofts, Inc., 1950. Chapter I, “The Rise of Modern 
Science," and Chapter XXIII, "Gestalt Psychology." 

Brubacher, John 8. (Editor), Eclectic Philosophy of Education. New York: Prentice- 
Hall, Inc., 1951. Chapter II, “Science and Philosophy of Education Compared.” 

Cohen, I. Bernard, Science, Servant of Man; a Layman’s Primer for the Age of 
Science. Boston: Little, Brown and Company, 1948. Part I, “The Nature of the 
Scientific Enterprise.” 

Davis, Frederick B., “The AAF Qualifying Examination." Army Air Forces Avia- 
tion Psychology Program Research Report No. 6. Washington, D. C.: U. S. Govern- 
ment Printing Office, 1947. 266 pages. 

Dewey, John, Logie, The Theory of Inquiry. New York: Henry Holt & Company, 
1938. Chapter XI, "The Function of Propositions of Quantity in Judgment," 
and Part IV, “The Logic of Scientific Method." 

Einstein, Albert, Out of My Later Years. New York: Philosophical Library, 1950. 
282 pages. 

Gulliksen, Harold, Theory of Mental Tests. New York: John Wiley & Sons, Inc., 
1950. 486 pages. 

Jahoda, Marie, Deutsch, Morton, and Cook, Stuart W. (Editors), Research Methods 
in Social Relations, with Especial Reference to Prejudice. New York: The Dryden 
Press, 1951. Part I, “Basic Processes,” and Part II, “Selected Techniques." 


Kemble, Edwin C., “Reality, Measurement, and the State of the System in Quan- 
tum Mechanics,” Philosophy of Science, 18: 273-299, October, 1951. “At the 
heart of modern experimental science is the controlled experiment in which ob- 
jectivity is secured by reducing all measurement to the reading of scales of one 
sort and another. Good experimentation, however, requires more than the re- 
placement of qualitative subjective observations by pointer readings. It requires 
careful analysis of all possible perturbing factors and repeated measurements to 
test the scatter of the results” (page 274). 

Lorge, Irving, "The Fundamental Nature of Measurement,” Chapter XIV in 
E. F. Lindquist (Editor), Educational Measurement. Washington, D. C.: Ameri- 
can Council on Education, 1951. 


26 THE PROBLEM OF MEASUREMENT 


Planck, Max, Scientific Autobiography, and Other Papers. New York: Philosophi: 
Library, 1949. 192 pages. 
Ruby, Lionel, Logic, an Introduction. New York: J. B. Lippincott Company, 1950, 
Part III, “The Logic of Truth: Scientific Methodology.” 
Russell, Bertrand, A History of Western Philosophy. New York: Simon & Schuster, 
1945. 895 pages. 
Sarton, George, The Life of Science; Essays in the History of Civilization. New York: 
Henry Schuman, 1948. Part I, “The Spread of Understanding." 
Stevens, S. Smith, “Mathematics, Measurement, and Psychophysics,” Chapter I 
in 8. Smith Stevens (Editor), Handbook of Experimental Psychology. New York: 
John Wiley & Sons, 1951. 
Whitney, Frederick L., The Elements of Research (Third Edition). New York: 
Prentice-Hall, Inc., 1950. Chapter I, “Reflective Thinking, Science, and Re 
search." и 
Williams, Donald, The Ground of Induction. Cambridge, Mass.: Harvard University 
Press, 1947. Chapter V, “The Logic of Science.” 


2 


The Historical Development of 
Measurement in Education 


A. Introduction 


Tests and measurements of one kind or another have played a far more 
prominent role in human history than is generally recognized. Nor has 
their use by any means been confined to the schools. In fact, among the 
earliest records of the use of various testing devices are those found in the 
Bible, although they generally have no direct reference to education. One 
illustration! will suffice: 


And the Gileadites took the passages of Jordan before the Ephraimites: and it 
was so, that when those Ephraimites which were escaped said, Let me go over; 
that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay; 
then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could 
not frame to pronounce it right. Then they took him, and slew him at the passages 
of Jordan: and there fell at that time of the Ephraimites forty and two thousand. 


Attention is called to the fact that here is indeed a "final examination" 
and in a field other than education. Doubtless measurement experts of the 
present time would point out that, in spite of a rather high degree of objec- 
tivity, there were certain dubious features: it was oral, it was very short, 
and the mortality rate was excessively high! 

A sociologist? attributes the remarkable stability of the Chinese civiliza- 
tion, the oldest culture of any modern nation, to five factors, one of which 
is her highly organized examination system. It began informally in 225 в.с., 
———— 

* Judges, 12: 5-6 (King James Version). 

* Paul F, Cressey, “The Influence of the Literary Examination System on the Devel- 


aen: of Chinese Civilization," American Journal of Sociology, 35: 250-262, Septem- 
er, 1929. 


27 


28 THE PROBLEM OF MEASUREMENT 


and became a definite civil service examination system in 29 в.с. The sys- 
tem, described as being thoroughly democratic, ruthless, invariable, and 
orthodox, has had profound effects, some good and some bad, not only upon 
the educational system of China, but also upon her whole civilization. 
On the one hand, it has preserved unity by keeping uniform throughout 
the empire the written language, literature, and traditions of the Chinese 
nation, and has helped to maintain political stability by keeping open to 
every citizen the door to prestige and power. On the other hand, it has often 
produced more graduates than could be given positions, has offered little 
assurance that the successful candidates possessed the qualities necessary 
for good officials, has sometimes resulted in corruption in the conduct of the 
examinations, and has in some degree stifled progress. 

Some kind of measurement or evaluation seems inevitable in education. 
It seems inherently an essential part of the teaching process.’ The situation 
has been well expressed as follows:* 


As far back as we have any record of school routines, teachers have always 
examined or tested, as well as taught. But our attitudes towards these two functions 
have been, historically, quite different. We have long understood that teaching is 
a highly skilled business, a profession, calling for special aptitudes and extensive 
preparation, and that both its techniques and its objectives are worthy of the most 
careful investigation. But examining or testing we have taken for granted as some- 
thing that anybody could do any time, quite casually, for any purpose he might 
happen to think of. It is only yesterday that it occurred to most of us that there 
might be skilled techniques of testing or that the uses we were accustomed to make 
of our tests and examinations might be open to question. 

Every teacher or administrator of more than twenty years’ service will recall 
with me that Age of Innocence when a “test” regularly consisted of ten questions, 
sometimes concocted impromptu as we wrote them on the blackboard, each 
weighted, by our arbitrary personal fiat, with a value of 10 on a scale of 100, and 
when the perfectly simple purpose of any “test” was to “pass” or “flunk” the 
testees. We knew no qualms in those days about reliabilities or validities or com- 
parability, and the sigma lay as far in the future (for us teachers) as television. 

It was only after the World War that this primal innocence was disturbed by 
the coming—into the consciousness of teachers generally, as distinguished from the 
psychologists—of what many of us still think of as “the new tests.” A bewildering 
series of strange inventions: intelligence tests first, and then objective achievement 
tests, and aptitude tests, and interest tests, and personality inventories and ratings. 
Nearly all of them appallingly elaborate; and alleged to have been most laboriously 


за this connection the experience of Russia is instructive. In 1917 the Soviet Govern- 
ment, wishing to achieve as complete a transformation of education as of government, 
did away with all forms of examinations and school marks. After fifteen years’ experi- 
ence, however, the Central Committee of the Communist Party declared the plan ineffec- 
tive and undesirable, and recommended the reintroduction of a rigid system of examina- 
tions and marks. See Earnest Martin Hopkins, “Prerequisites of Intelligence,” School 
and Society, 40: 473-480, October 13, 1934; “Changes in the Soviet Educational System,” 
School and Society, 42: 836-837, December 14, 1935; and “Testing Russian Students,” 
вао алі Society, 50: 25-26, July 1, 1939. 

* Max McConn, “Examinations Old and New: Thei iE ional 
Record, 16: 9/5. Оона 109. г Uses and Abuses,” Educationa 


DEVELOPMENT OF MEASUREMEN'? IN EDUCATION 29 


prepared, with every item studied and checked, and to have been tried out on 
hundreds or thousands of students, and then re-studied and re-checked by mysteri- 
ous statistical methods. 


The foregoing picturesque statement is accurate enough for “teachers 
generally,” as the author intended, but is far from true of certain outstand- 
ing leaders in the profession. Horace Mann, for example, almost one hun- 
dred years ago, had a remarkable conception both of the importance of 
examinations and of the limitations of the forms then in existence, His 
penetrating analysis of the weaknesses of the oral examinations then in 
vogue, and of the superiority of written examinations, could hardly be 
improved upon by the modern specialist in measurement. Mann showed 
clearly the points where the oral examinations were lacking, in the technical 
language of today, in validity, reliability, and usability.* 

Another American educator who understood both the value and the 
limitations of examinations was Emerson E. White, widely known as a 
writer and school administrator. In 1886 he wrote: “It may be stated as a 
general fact that school instruction and study are never much wider or bet- 
ter than the tests by which they are measured."? In the same volume the 
author enumerates several “special advantages" of the written test :* 


It is more impartial than the oral test, since it gives all the pupils the same tests 
and an equal opportunity to meet them; its results are more tangible and reliable; 
it discloses more accurately the comparative progress of the different pupils, in- 
formation of value to the teacher; it reveals more clearly defects in teaching and 
study, and thus assists in their correction; it emphasizes more distinctly the im- 
portance of accuracy and fullness in the expression of knowledge; it reveals more 
fully than the ordinary language exercise the ability of the pupil to write correctly 
when his attention is directed to the thought or subject-matter; it is at least an 
equal test of the thought-power or intelligence of pupils, since this result, in both 
methods, is dependent upon the nature of the tests; and, lastly, the certainty of 
the coming written test affords a healthy stimulus to pupils, increasing their atten- 
tion to instruction, and their efforts to master the subjects taught. 


These views of Mann and White appear surprisingly modern and show 
how far the practice of the rank and file is likely to fall behind the theory 
of the pioneer thinker. It is doubtful if any single sentence in recent educa- 
tional literature states the superiority of the written over the oral examina- 
tion more completely or more forcefully than the one just quoted from 


* Otis W. Caldwell and Stuart A. Courtis, Then and Now in Education: 1845-1928, 
Pages 37-41. Yonkers: World Book Company, 1923. 

* These terms will be explained in Chapter 4. 

"Emerson E. White, The Elements of Pedagogy, page 148. New York: American Book 
Company, 1886, i 

* Ibid., pages 197-198. However, in an unusual article dealing with a semi-objective 
use of oral questioning in the classroom Max M. Kostick and Belle M. Nixon present 
another side of the argument: “How to Improve Oral Questioning,” Peabody Journal of 
Education, 30: 209-217, January, 1953. i > T ^ 


30 THE PROBLEM OF MEASUREMENT 


White. In fact, many modern specialists in measurement would probably 
accept the above indictment of oral tests їп toto. They would, of course, 
wish to discount somewhat the values so enthusiastically proclaimed for 
ordinary written examinations, and would point out that many of the limi- 
tations of the oral tests so forcefully stated also hold in some degree for the 
written tests, and in addition that the latter have some special limitations 
of their own not then recognized. But that is another story to be told later. 


B. The History of Intelligence Tests 


In Jevons' The Principles of Science, published in 1874, occurred this 
significant statement:? 


As physical science advances, it becomes more and more accurately quantitative. 
Questions of simple logical fact after a time resolve themselves into questions of 
degree, time, distanee, or weight. Forces hardly suspected to exist by one genera- 
tion, are clearly recognised by the next, and precisely measured by the third 
generation. But one condition of this rapid advance is the invention of suitable 
instruments of measurement. . . . Accordingly the introduction of a new instru- 
ment often forms an epoch in the history of science. 


While the foregoing statement was intended as a history of the past 
development of physical science, it is also a remarkably accurate prophecy 
of the future development of measurement in psychology, which Jevons 
appears to have foreseen, as indicated by his reference to the "fact that 
man in his economical, sanitary, intellectual, aesthetic, or moral relations 
may become the subject of exact sciences, the highest and most useful of 
all sciences." 0 This statement is all the more remarkable when one con- 
siders that it was made five years before Wundt established the first psy- 
chological laboratory and Galton began publishing his most important 
studies of individual differences, while both Binet and Cattell were lads in 
their teens, and before either Thorndike or Terman had been born. But it 
was over a quarter of a century before any very definite progress was made 
toward fulfilling the prophecy. Then there followed rapid progress in that 
direction along several lines. That story will now briefly be told. 

Germany and experimental psychology. An important event in the 
history of psychology was the establishment of the first experimental labo- 
ratory in psychology by Wilhelm Wundt at Leipzig in 1879. He was, how- 
ever, primarily interested in the analysis of consciousness into elements 
in à manner analogous to that employed in atomic chemistry. His sole 
interest in measurement appeared to be confined to reaction times, and 
he was distinctly unsympathetic to the problem of individual differences; 
but he did influence considerably the course of psychology, especially the 
work of other German psychologists, such as Kraepelin, Ebbinghaus, and 


° W. Stanley Jevons, T'he Principles of Science, Book III, page 313, New York; The 
Macmillan Company, 1874, 
Y Ibid., page 386, 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 31 


Meumann, who introduced many forms of separate tests, which were bor- 
rowed by later investigators in constructing their scales for measuring 
general intelligence. Of the test forms, the completion test of Ebbinghaus 
was doubtless the most important. 

Another important idea, suggested in 1912 by Stern, was that of repre- 
senting intelligence as the ratio of mental age to chronological age. This 
concept, for which Stern suggested the term “mental quotient," was later 
adopted by Terman as the familiar IQ. 

England and statistical methods. The distinctive contribution of 
the English to the measurement of intelligence has been that of statistical 
methods as a tool for the analysis of test results. Sir Francis Galton, one of 
the most brilliant and versatile men of the nineteenth century, was the first 
to treat seriously the problem of individual differences in psychology, par- 
ticularly in the realm of sensory discrimination, although Weber, Fechner, 
Helmholtz, and others had given slight attention to it in what is often 
termed psychophysics. In 1883 Galton outlined a method for studying free 
association by quantitative methods. But his most notable contribution 
was in statistical analysis, where he suggested among other things a graph- 
ical method of representing correlations." Karl Pearson, a pupil of Galton, 
and Charles E. Spearman still further advanced the science of statistics. 
Spearman developed his well-known two-factor theory of intelligence on 
the basis of statistical analysis. Cyril Burt, who has been a leader in intro- 
ducing and adopting Binet’s work in Great Britain, was in 1913 officially 
appointed school psychologist, possibly the first person in the world to 
occupy that position. 

France and abnormal psychology. The French have long been leaders 
in abnormal psychology. Consequently, they approached the problem of 
measuring intelligence from the standpoint of the classification and treat- 
ment of the mentally defective. This brings us to the most important name 
in the history of intelligence testing, Alfred Binet. 

It would be hard to find a man who better illustrates Jevons’ description 
of the methods of the genius than Binet. Jevons says:” 


It would be a complete error to suppose that the great discoverer is one who 
seizes at once unerringly upon the truth, or has any special method of divining it. 
In all probability the errors of the great mind far exceed in number those of the 
less vigorous one. Fertility of imagination and abundance of guesses at truth are 
among the first requisites of discovery. 


This reads as if it were designed specifically to describe Binet, and yet it 
appeared twenty years before Binet founded L’Année Psychologique and 
thirty years before his first scale for measuring intelligence. He first studied 
law, then medicine, and afterward worked in a biological laboratory. Later 
_ 

и David G. Ryans, “Francis Galton’s Statistical Contributions,” School and Society, 


48: 312-316, September 3, 1938. 
BW. Stanley Jevons, op. cil., Book IV, page 221. 


32 THE PROBLEM OF MEASUREMENT 


he turned psychologist, first of the arm-chair variety, and finally ended as 
an experimentalist. Furthermore, in an effort to devise a suitable method 
of measuring intelligence he tried out various head measurements, physi- 
ognomy, graphology, and palmistry, before hitting upon the correct ap- 
proach. Binet never seemed to be quite sure what he meant by “‘intelli- 
gence,” what he was trying to measure, for he changed his definitions 
repeatedly. It is clear, therefore, that he did not hit “at once unerringly 
upon the truth," and that he did possess to a marked degree "fertility of 
imagination and abundance of guesses." Such errors as he made, and they 
were numerous, were not the unintelligent ones of blind trial and error, 
but rather the intelligent errors of judgment, made by acting upon the 
course which seemed most promising from a survey of the best available 
facts at hand. 
It is doubtless true, as Boring suggests: 


At close view, the course of science seems discontinuous; all at once a "genius" 
makes a discovery or formulates a theory, and productive research follows on im- 
mediately. At the greater range of historical perspective, the course of science 
seems to be continuous, and the “genius” appears as an opportunist who takes 
advantage of the preparation of the times. 


Opinions will differ regarding the appropriateness of the word *opportun- 
ist” in Binet’s case, but there can be no doubt that he did take “advantage 
of the preparation of the times." Both for his ideas and actual test ma- 
terials he drew freely from others, notably his fellow countryman, Blin, and 
his contemporaries in Germany. Nevertheless, Binet did something the 
others had not done; he began where they left off and continued with a 
definite contribution both to the theory and practice of testing. On the 
theory side he enlarged the prevailing concept of intelligence, introducing 
such ideas as those of judgment, adaptation, and self-criticism. Terman™ 
argues that Binet’s outstanding contribution to psychometrics was his 
abandonment of any attempt to measure “intellectual faculties as such.” 
To practice he contributed a technique of scale construction and a finished 
scale consisting of test situations selected according to predetermined cri- 
teria and standardized. The date 1905 is important, therefore, because it 
marked the appearance of the first scale for the measurement of intelli- 
gence, which, crude as it was, has served as the pattern for subsequent tests 
and scales the world over. The 1908 revision was a definite improvement, 
and is especially notable for the introduction of the mental age concept. 


Further experimental work resulted in the scale of 1911, the year of Binct’s 
death. 


8 Edwin С. Boring, A History of Experimental Psychology, page 452. New York: 
The Century Company, 1929. Used by permission of Appleton-Century-Crofts, Inc. 
“Lewis M. Terman, “Тһе Revision Procedures,” page 6. Chapter I in Quinn 
d Dn The Revision of the Stanford-Binet Scale. Boston; Houghton Mifflin Com- 
pany, 1942. Е Wu 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 33 


America and applied psychology. The scene now shifts to America, 
where the outstanding name is J. McKeen Cattell, who was a pioneer along 
many lines and & promoter of the first rank. More than anyone else, Cattell 
was responsible for giving to American psychology its practical bent, for 
with him the practical took precedence over the philosophical. As early 
as 1885 he began to publish important articles on reaction times and in- 
dividual differences. It was Cattell!5 who in 1890 suggested the term “men- 
tal tests," which was to become a sort of trade-mark for the whole measure- 
ment movement. But Cattell was too close to Wundt’s laboratory to escape 
altogether the views of its master. Cattell, therefore, just as did Galton, 
confined his tests largely to the simpler mental processes, such as sensory 
discrimination, where individual differences are least, rather than to the 
higher mental processes, where they are greatest. In other words, both 
Galton and Cattell attempted to measure intelligence, but with the wrong 
tools. Very little attention was given either to reliability or to validity. 
Consequently, in 1901, when Wissler'* published his analysis of Cattell's 
tests used with college students, in which he applied for the first time 
the Pearson correlation technique to test scores, and in which he found 
little more than chance relationship either among the tests themselves or 
between the tests and college work, a considerable damper was thrown over 
the enthusiasm of American testers which was not lifted till after Binet had 
published his 1905 scale. Nevertheless Cattell’s influence upon measure- 
ment, through both his writing and his students, notably Thorndike, has 
been great. 

Goddard was probably the first American psychologist to recognize the 
practical value of Binet’s 1908 and 1911 scales, which he translated and 
with minor adaptations tried out af Vineland.” In 1911 and 1912 Kuhl- 
mann published his revisions of the 1911 Binet scale, extending it down- 
ward to the age of three months, instead of three years, which was Binet’s 
lower limit. 

It remained for Terman of Stanford University to provide the first 
thoroughgoing revision, carefully adapted to and standardized for use with 
American children, normal as well as subnormal. Terman’s scale, known 
as the Stanford Revision or Stanford-Binet, appeared in 1916, together 


vitali 


55 J. McK. Cattell, “Mental Tests and Measurements," Mind, 15: 373-380, 1890. 

.' Clark Wissler, The Correlation of Mental and Physical Tests," Psychological Re- 
view, Monograph Supplement, Vol. VIII, No. 16, 1901. 

Henry H. Goddard, “Four Hundred Feebleminded Children Classified by the Binet 
Method,” Pedagogical Seminary, 17: 387-397, September, 1910; *A Measuring Scale 
for Intelligence,” The Training School, 6: 146-155, January, 1910; “Two Thousand 
N ormal Children Measured by the Binet Measuring Scale of Intelligence,” Pedagogical 
Seminary, 18: 232-259, June, 1911. 

18 For an excellent historical discussion, see Florence L. Goodenough, Mental Testing: 
Its History, Principles, and Applications, Chapter IV, "The Early Tests (1887-1915),” 
and Chapter V, “Later Developments," New York: Rinehart & Company, Ine., 1949. 


34 THE PROBLEM OF MEASUREMENT 


with а most complete manual, The Measurement of Intelligence.” This revi- 
sion has been criticized on the ground that it was standardized entirely on 
school ehildren, which may result in somewhat of a handicap for those of 
poor academic background, and that it did not produce a sufficient “‘scat- 
ter" in the distribution of IQ's, particularly at the higher ages. It has also 
been criticized on the ground that its norms were based exclusively on the 
children of one state, California, which may not be truly representative 
of the United States as a whole. Nevertheless, the Stanford Revision was, 
for more than two decades, the most widely used and most highly regarded 
individual intelligence test in existence. In 1937 a thorough revision of the 
Stanford-Binet appeared.” This second revision corrected most of the weak- 
nesses of the first revision. 

Two other distinctly American developments, both aiming to make in- 
telligence tests more practical, remain to be discussed. These early tests 
had two practieal disadvantages which militated against their wide use. 
One of these was that the tests were highly verbal; that is, their successful 
administration required that the subject taking the test understand the 
English language. The other was that the tests were individual; that is, 
only one person could be examined at a time. Reasonably satisfactory 
solutions came to both of these problems in the year 1917; and this leads 
to an interesting story. 

Intelligence tests, children of necessity. One cannot but be im- 
pressed with the curious role of necessity in the development of intelligence 
testing both in Europe and in America. Although it may be true, as Thorn- 
dike suggests, that necessity is not the true mother of invention, she is, 
often at least, the stern, relentless stepmother. Two instances in Europe 
and two in America will suffice to make this clear. 

In 1897 Ebbinghaus was appointed on a commission to investigate the 
problem of fatigue in the schools of Breslau. As there were in existence no 
appropriate tests, Ebbinghaus set about to devise them; the first “сот- 
pletion tests” resulted from this endeavor. Seven years later the Minister 
of Public Instruction in Paris became concerned about the high percentage 
of failure in the Paris schools and appointed Binet on a commission to 
determine those who were so mentally unfit as to necessitate instruction 
in special elasses. Binet, too, found available measuring instruments in- 
adequate for the purpose.” Out of this difficulty emerged the 1905 scale 
already referred to, the first successful instrument for measuring intelli- 

Е Lewis M. Terman, The Measurement of Intelligence, 362 pages. Boston: Houghton 
irm GM Te e d Maud A. Merril Ў i 
eU Pid sited hae re 4 1, Measuring Intelligence, 461 pages. Boston: 
_ 4 Binet confessed himself unable to distinguish an idiot, who was described by exist- 
ing standards as having a “gleam” of intelligence and an attention which was “fugitive,” 
from an imbecile, who was described by these standards as having a “very incomplete 
degree" of intelligence and an attention which was “fleeting.” Verily, here indeed was à 
distinction without a difference. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 35 


gence according to modern conceptions. But the original Binet scales and 
their early revisions, both in Europe and America, possessed the two limi- 
tations mentioned above: namely, they were highly linguistie and they 
were individual scales. Soon American ingenuity was to offer solutions to 
both problems. 


In each of the lines below, the first two words are related to each other in some 
"What you are to do in each line is to see what the relation la between the first two 
and underline the word in heavy type that is related in the same way to 
Begin with No. 1 and mark аз many sets as you can before time is called. 


cold—heat: : je 
uncle—nephew: 
framework—i 
breeze—cyclone: 


polite—impoli 

mayor—city :: general— private navy army soldier 
succéed—fail:: praise—lose friend God blame . 
people—house:: bees thrive sting hive thick . 
peace—happiness:; war— grief fight battle Euro 
a—b:c—e b letter. . 

darkness—atiliness:: light — 

hard— brittle money easy work ....- 
rmonious— hear accord violin ducordant . 
ili 


hope—chi 
dismal—dark;; cheerful— laugh bnght bouse gloomy 


Courtesy National Academy of Sciences. 
Figure 1. Test 7 from the Army Alpha. 


Pintner and Paterson, finding the Stanford-Binet unsatisfactory for deaf 
children, met the first difficulty by assembling a series of fifteen tests of 
manipulation or performance, such as the form board already used by 
Seguin, Healy, and others. This combination, which appeared in 1917, was 
known as the Pintner-Paterson Performance Scale. That same year the 
United States found herself in World War I faced with the urgent necessity 
of training a large citizen army with an insufficient supply of commissioned 


36 THE PROBLEM OF MEASUREMENT 


and noncommissioned officers. In this emergency the American Psychologe 
ical Association placed its services at the disposal of the War Department, 
The existing individual intelligence tests were not only entirely unsuited ` 
for use with illiterate and foreign-speaking recruits, but they were also - 


Courtesy National Academy of Sciences, 
Figure 2. Test 6 from the Army Beta. 


mueh too slow. 'To meet this need, a committee of psychologists, utilizing 
largely the as yet unpublished work of Otis, prepared the Army Alpha,? 
the first of a long succession of group tests destined to receive wide use. 
The second difficulty of the early tests had now been solved, for a group 
test can be administered to a hundred or more in the time formerly required 
for measuring one. 


* 


* Although the Army Alpha had antecedents developed over the preceding 30 years, 
none of these earlier tests can be said to have passed beyond the experimental stage. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 37 


It should be noted, however, that the Pintner-Paterson Performance 
Scale and the Army Alpha group tests each solved but one difficulty at a 
time. In fact, as a rule, group tests of the Army Alpha type are even more 
verbal than the individual tests had been. Figure 1 shows a sample page 
from the Army Alpha. The early performance scales, on the other hand, 
were nonverbal, but eould be administered to only one person at a time. 
The Army Beta, designed for illiterate and foreign-speaking soldiers, was 
the first test to combine the group and performance ideas. It also appeared 
in 1917. Figure 2 shows the picture completion test of the Army Beta. 
Since that time several group tests, largely or primarily of the performance 
type, have been designed specifically for use with young children just enter- 
ing school. There can be little doubt that World War I gave a decided 
impetus to the measurement movement in America, World War II had a 
similar and perhaps even more marked effect.” 

Tests of specific aptitude. All of the tests so far described have been 
for the measurement of general intelligence. There has also been some 
activity in the development of tests of specific intelligence, or capacity in a 
restricted area, such as music or mechanics, or in a specifie school subject, 
such as algebra or Latin. America has also had the lead in the development 
of these tests, often called aptitude or prognosis tests. One of the earliest 
and in many ways one of the best known of these tests is the Seashore Test 
of Musical Talent, which appeared in 1915. Three years later appeared the 
Stenquist Test of General Mechanical Ability. In 1918, also, Rogers pub- 
lished a test of mathematical ability, which, although hardly an aptitude 
test in the modern sense, introduced the idea which other authors have 
followed up by aptitude tests in the special branches of mathematics, such 
as algebra and geometry. A somewhat different type of test on the college 
level is illustrated by the Iowa Placement Examinations which appeared 
in 1924. A recent and promising type of test, of which there are several 
examples, is that of reading readiness, to be used to determine a child's 
fitness for the work of the first grade. There is evidence that the develop- 
ment of the future is likely to be along the line of tests for speczfic aptitude, 
rather than tests of general intelligence, which aim to cover the whole range 
of human capacity at one shot. The test maker, as well as the bird hunter, 
may aim at too large a target. Dunlap argues this side of the case well;?* 
“The more ‘general’ the intelligence test, the less its value. By increasing 
the specificity . . . we add to its value. Charles Dudley Warner once shot 
a bear by ‘aiming at it generally,’ but it is a poor method.” Thurstone’s 
attempt to devise tests of what he terms “primary mental abilities” is a 

23 For a discussion of the Army General Classification Test, see: Staff, Personnel Re- 
Search Section, “The Army General Classification Test,” Psychological Bulletin 42: 
760-768, December, 1945. A civilian edition of the AGOT was published by Science 
Research Associates in 1947. 


ae Knight Dunlap, Habits, Their Making and Unmaking, page 266. New York: 
Liveright Publishing Corporation, 1982. 


38 THE PROBLEM OF MEASUREMENT 


move in this direction, although these tests have not yet demonstrated 
marked superiority over other tests in practical schoolroom situations.” 


C. The History of Achievement Tests 


Progress before 1918. The early history of things which have been in 
existence a long time is usually somewhat obscure. This is true of achieve- 
ment tests, whose ancient use has already been referred to. Not only have 
some kinds of tests been in existence for centuries, but attention has been 
called to the fact that criticisms of them, both destructive and constructive, 
are by no means new. 

But the actual work of improving the existing instruments has always 
lagged far behind the theory, and actual school practice has been furthest 
behind of all. In spite of the marked superiority of written examinations 
over oral, pointed out by Horace Mann in 1845, educators did not forth- 
with adopt the former or improve the latter.” However, as early as 1864 
an English schoolmaster, the Reverend George Fisher," evidently realizing 
the subjectivity of ordinary examinations, proposed a *Scale-Book," made 
up of "various standard specimens . . . arranged in order of merit."? 
But Ayres observes that “Mr. Fisher's efforts seem to have produced no 
lasting results," for which he suggested this explanation:* 


Progress in the scientific study of education was not possible until people could 
be brought to realize that human behavior was susceptible of quantitative study, 
and until they had statistical methods with which to carry on their investigations. 


Although Ayres felt that Galton's work had largely met these two needs, 
he gave Dr. J. M. Rice the honor of being the “real inventor of the com- 
parative test” in America in 1894.? Rice had studied in Germany and had 
come under the influence of experimental psychologists both at Jena and 
Leipzig. Here again the attitude of the educational leaders was anything 
but cordial, and “for more than ten years but little progress was made 
beyond the work of the pioneer himself.” °t 


25 Duane C. Shaw, “А Study of the Relationships between Thurstone Primary Mental 
yum and High School Achievement," Journal of Educational Psychology, 40: 239—249, 

pril, 1949. 

26 As Caldwell and Courtis observe: “Уегу few schoolmen proved to have the intelli- 
gence of Horace Mann, and the era foreseen by him did not begin to materialize until 
more than fifty years later." Op. cit., page 8. 

21 For а good discussion of the early history of achievement tests, see: Leonard Р. 
Ayres, “History and Present Status of Educational Measurements," Seventeenth Y ear- 
book of the National Society for the Study of Education, Part I1, pages 9-15. Bloomington, 
Illinois: Public School Publishing Company, 1918. 

28 Quoted by Edward L. Thorndike and Isaac №. Kandel in “Educational Measure- 
ment of Fifty Years Ago," Journal of Educational Psychology, 4: 551-552, November, 
1913, from E. Chadwick's article in Museum, A Quarterly Magazine of Education, Litera- 
lure and Science, 3: 480-484, 1864, where he quotes the Reverend Fisher. 

?? Leonard P. Ayres, op. cit., page 10. 

3 Leonard P. Ayres, ibid., page 11. 

31 Leonard P. Ayres, ibid., page 12. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 39 


Ayres makes a distinction between the “inventor” of educational meas- 
urement and the "father" of the movement. The latter distinction he 
awards to Dr. Edward L. Thorndike. The honor is richly merited, for no 
other person has touched the measurement movement at so many points 
or has contributed so much to it. In addition to his very influential publica- 
tions on statistical methods in education and his pioneer work on intelli- 
gence tests for college entrance, either Thorndike or his students were 
responsible for most of the early standard tests and scales for measuring 
achievement. The first test was the Stone Arithmetic Test published in 1908, 
and the first scale was the Thorndike Handwriting Scale announced in 1909 
and published the following year. The next few years saw the appearance 
of scales and tests in various fields. The school survey movement un- 
doubtedly added impetus to the measurement movement, as did the 
appearance of certain important books and periodieals to be referred to 
later. 

Studies in the unreliability of school marks and examinations. 
But there was an additional factor which served as a very strong stimulus 
to standard tests: Educators discovered for the first time just how bad existing 
measurements were. Beginning about 1910, several studies in rapid succes- 
sion made this point convincingly clear. A distinction should be made be- 
tween the limitations of school marks and the limitations of school exami- 
nations. The need for reform in college marking was forcibly brought to 
public attention by Max Meyer,” who reported on the marks collected 
from forty instructors for a period of five years at the University of Mis- 
souri. He found such astonishing variations as 55 per cent of A’s in philos- 
ophy and only 1 per cent in Chemistry III, while there were 28 per cent of 
failures in English IE and none in Latin I. Johnson? found a similar con- 
dition in the University of Chicago High School. In a two-year period he 
found, for example, that the marks for German showed 17.1 per cent A’s 
and 8.4 per cent F’s, whereas the marks in English showed 6.5 per cent A’s 
and 15.5 per cent F’s. Such variations both at Chicago and Missouri could 
be most reasonably interpreted on the supposition, not that English is 
harder than foreign languages, but that English instructors are harder. 
In other words, school marks are highly subjective, the mark received often 
being more a function of the personality of the instructor than of the per- 
formance of the student. Further studies showed similar results elsewhere 
without exception. This was certainly disturbing, if not, as Thorndike sug- 
gests, actually “scandalous.” 

But the evidence presented by a second type of study was even more 
damaging. Variations among the final marks in different departments 

32 Max Meyer, “The Grading of Students,” Science, 28: 243-250, August 21, 1908. 
E Johnson, *A Study of High-School Grades," School Review, 19:13-24, 


Ра Twenty-First Yearbook of the National Society for the Study of Education, Part I, 
e2, 


40 THE PROBLEM OF MEASUREMENT 


might be accounted for, at least in part, by variations in the background, 
intelligence, and application of the students in these departments. This, 
at any rate, provided a comfortable loophole. But even this avenue of 
escape was soon to be closed, Manifestly, such factors could not be respon- 
sible for differences when several persons were marking the same student’s 
paper, and least of all when the same person marked the same paper on two 
different occasions. But studies in abundance have established both con- 
ditions. 

Perhaps the most striking of the early studies were those of Starch and 
Elliott. In one of these studies Starch and Elliott? used facsimiles of the 
same geometry paper which were marked by 116 high-school teachers of 
mathematics. The values assigned ranged from 28 to 92. Manifestly, if 
high-school teachers cannot agree any more closely than that in mathe- 
matics, one of the most objective subjects, the situation is indeed bad. 

Other studies tended but to confirm the suspicions. One of the most 
spectacular was made by Falls," who had 100 English teachers mark a 
composition by assigning it a percentage value and also indicating the 
school grade in which they would expect that quality of work to be done. 


TABLE 1 


Tue Estrmatep GmADE-VALUE AND PERCENTAGE MARKS AssiGNep TO AN ENGLISH 
Composition вх Охе Номовер TEACHERS (arrer Fars) 


rade Percentage Mark 
Value | 60-64 | 65-69 | 70-74 | 75-79 80-84 | 85-89 | 90-94 | 95-99 | Total 
XV 2 2 
XIV 0 
ХПІ 1 2 3 
XII 1 2 3 6 
XI 2 6 5 2 15 
x 1 3 8 4 7 1 24 
IX 1 1 1 8 4 4 3 22 
VIII 2 2 2 3 4 3 16 
VH 2 2 2 1 7 
VI 1 1 1 1 4 
у 1 1 
Total 3 6 8 22 20 24 17 | 100 


As will be noted from Table 1, the percentage values varied from 60 to 98, 
and the estimated grade location, from the fifth grade to the junior year of 
college. As a matter of fact, the composition was the best one found by & 


35 Daniel Starch and Edward C. Elliott, “Reliability of Grading Work in Mathe- 
matics,” School Review, 21: 254-259, April, 1913. Я 
уч поь "Research in Secondary Education." Kentucky School Journal, 6: 42-46, 
arch, Н ^ 


D А алба 


е Lal 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 41 


survey committee at Gary, Indiana, a few years earlier, and was written 
by a high-school senior whose special interest was journalism and who was 
a correspondent for some of the Chicago newspapers. It is not unreasonable 
to suppose that many of these English teachers will never have as good a 
composition submitted by one of their pupils or that few of these teachers 
could have written a better one themselves. 

Evidence is available that examiners in other fields show variations fully 
as startling as those reported in publie education. Ten examination papers 
written by applicants for licenses to practice dentistry in Kentucky were 
submitted for regrading to the regular examiners on the official boards of 
23 other states. The results are summarized in Table 2. The papers are 
arranged in rank order from low to high (1 being highest) according to the 
median judgment of the 24 examiners, who are designated by the letters 
A to X according to degree of strictness in marking. That the variations 
are enormous is indicated by several facts. With a minimum passing mark 
of 75, it will be noted that every paper was passed by at least four exam- 
iners, and failed by at least four other examiners. The most liberal ex- 
aminer, A, passed them all, while the two strictest, W and X, failed them 
all! Seven different papers were rated by one or more examiners as the best 
of the ten, while two of these seven papers were rated by other examiners 
as the poorest of the ten. Surely such a situation can hardly be regarded as 
anything but chaotic.” 

But Starch? also presented the problem in a different and still more 
unfavorable light. He found that college instructors assigned different 
marks when they regraded their own papers without knowledge of their 
former marks. In a later study Ashbaugh® had 49 Ohio State University 
seniors and graduate students, the latter with teaching experience, rate a 
seventh-grade arithmetic paper on a percentage basis three times, at in- 
tervals of four weeks between ratings. Some idea of the lack of consistency 
in scoring can be gained when it is mentioned that only one student gave 
the same total score on all three trials and only seven gave the same total 
score on any two successive trials. The mean differences between pairs of 
scores on successive trials were as follows: between the first and second 
trials, 8.1 points; between the second and third trials, 7.3 points. In studies 
by the writer using the same arithmetic paper, he has found variations of 
as much as 27 points on successive trials by the same scorer and as much 
as 10 points variation on values assigned to the first problem on two suc- 
cessive trials approximately ninety days apart. 


37 For a fuller account of this investigation, see: Leon M. Childers, “Report of the 
Research Committee on Examinations,” Proceedings of the Sixtieth Annual Meeting, 
National Association of Dental Examiners, 60: 77-106, August, 1942. 

33 Daniel Starch, ‘Reliability and Distribution of Grades," Science, 38: 630-636, 
October 31, 1913. 

3 Е. J. Ashbaugh, “Reducing the Variability in Teachers’ Marks,” Journal of Edu- 
cational Research, 9: 185-198, March, 1924. 


THE PROBLEM OF MEASUREMENT 


42 


м 


3001н3998 


SYONTIH AdNWTHID 


r9 3309 
Erano 


(35әмо] S! 01) $22024 ue] э43 jo JepJQ yuRY 


с WIdV.L 


1SdONF 
uw19408 


paugissy 
ѕәпјед 
e8ejua218d 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 43 


Tn a similar study, Hulten” found that 28 experienced Wisconsin high- 
school English teachers differed widely on trials at an interval of two 
months in the values assigned an English composition, which they thought 
was written by an eighth-grade pupil, but which was really part of the 
Hudelson Scale, at the time new and unfamiliar. He found that 15 teachers 
who gave passing marks the first time would have failed the pupil the 
second time the paper was marked and that 11 teachers who gave failing 
marks the first time would have passed the pupil the second time. Studies 
involving English composition are especially significant, because every es- 
say examination is a series of compositions, and when English teachers 
who presumably have more than ordinary skill in this field can agree neither 
with other teachers nor with themselves a second time, the situation is very 
serious.*! 

In February, 1918, Thorndike published what has proved to be probably 
the most influential paper that has ever appeared on educational measure- 
ments. It began with the well-known dictum: “Whatever exists at all exists 
in some amount,” and ended with this note of satisfaction: “Of the gains 
made in the past decade, we may well be proud.” As he looked into the 
future, Thorndike saw it conditioned by a series of if's:* 


If those who object to quantitative thinking in education will set themselves to 
work to understand it; if those who criticise its presuppositions and methods will 
do actual experimental work to improve its general logie and detailed procedure; 
if those who are now at work in devising and in using means of measurement will 
continue their work, the next decade will bring sure gains in both theory and 
practice. 


We shall now take a look at what really happened in the years following 
Thorndike’s statement of possible achievements. 

Progress since 1918. According to Buckingham,“ it was in 1919 that 
“test-making passed from an amateur to a professional basis.” A good sum- 
mary of the next decade has been made by Monroe, to which reference 
has already been made. The monograph begins with the assurance that the 
pioneer state of educational research is passed and that “quantity produc- 
tion” has been achieved. And, as is to be expected, much of the output is 


% C, E, Hulten, “The Personal Element in Teachers’ Marks,” Journal of Educational 
Research, 12: 49-55, June, 1925. 

. * Nor is the situation peculiar to America. Studies reported in Europe reveal condi- 
tions fully as bad. In England, for example, examiners were found to change their 
judgments considerably when they were asked to mark again the same papers they had 
scored a year before. See School and Society, 44: 364, September 19, 1936. 

? Edward L. Thorndike, “The Nature, Purposes, and General Methods of Measure- 
ments of Educational Products,” Seventeenth Yearbook of the National Society for the 
Study of Education, Part 11, 1918, pages 16-24. Quoted by permission of the Society. 

“R. B. Buckingham, “Our First Twenty-five Years,” Proceedings of the National 
Education Association, 1941, page 354. 

“ Walter S. Monroe and others, Ten Years of Educational Research, 1918-27, 368 
ES Bureau of Educational Research Bulletin, No. 42. Urbana: University of Illinois, 


44 THE PROBLEM OF MEASUREMENT 


not up to the highest quality, when judged by modern standards. Monroe, 
however, detects some evidence of a growing convietion that the emphasis 
should be upon quality of work rather than upon mere quantity. 

Moreover, by 1927 there were already developments in new directions 
which represented a distinct advance. The earlier standard tests of achieve- 
ment were largely of the general or survey type, which afforded a general 
all-around measure of the pupil's attainment in the subject, but which did 
not give the detailed information required for remedial work. The next 
decade saw the development of various achievement tests of a specific type. 
For example, there appeared in several fields diagnostic tests, whose func- 
tion was to give specific information regarding the pupil’s strong and weak 
points. Also practice tests were developed, especially in arithmetic, whose 
primary function was not so much measurement as drill. Another important 
development of this period was the organization of tests into batteries made 
up of survey tests in the more important subjects, all published in a single 
booklet. In 1920, two such batteries appeared, one by Pintner and the 
other by Monroe and Buckingham. Two years later appeared the first 
edition of the well known Stanford Achievement Test, which, with suc- 
cessive revisions, has continued to set a high standard. 

There was also a rapid development of high-school tests in the major 
academic subjects. Even today, however, measurement in high school can 
hardly be said to have kept pace with that in the elementary school. There 
has also been some activity, but less marked, in the development of achieve- 
ment tests on the college level. 

There still remained at the end of the first decade of standardized tests 
an important need that had not been met. Confidence in the ordinary 
school examination had been seriously undermined by such studies as those 
to which reference has already been made, and as yet no suitable substitute 
had been found. Also, there were many fields, especially in high school and 
college, where there were hardly any standard tests. Even'in the subjects 
most fully provided with such tests, they were by no means adequate to 
supply the needs of the classroom teacher. Furthermore, standard tests 
represented a considerable item of expense which school boards at that time 
were often reluctant to assume. The so-called objective, or new-type, test was 
devised to meet just this need. McCall“ seems to have been the first to 
suggest this type of test, which was merely an adaptation by the classroom 
teacher of the form of the test items used in the standard test. Such tests 
were usually mimeographed, but they were not standardized. Soon they 
were widely and often uncritically used. 

Monroe* has given a brief summary of the measurement movement for 


45 William A. McCall, “А New Kind of School Examination,” Journal of Educational 
Research, 1; 33-46, January, 1920. 

** Walter S. Monroe, “Educational Measurement in 1920 and in 1945," Journal of 
Educational Research, 38: 334-340, January, 1945. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 45 


the quarter of & century beginning in 1920. It was during these eventful 
years that educational measurement passed from early adolescence to early 
adulthood. 

Improved examinations, children of necessity. Attention was called 
earlier in the chapter to the role of necessity in the development of intelli- 
gence tests. Much the same influence is evident in the development of 
improved measurement of achievement. The origin of the objective test 
referred to above is a ease in point. Three other instances will be cited 
briefly. 

It was customary in the early days for the school committees in Mas- 
sachusetts to give oral examinations in the schools under their control. 
By 1845 the enrollments had become so large in Boston that the committee 
eould no longer devote the time required for anything more than the most 
easual examination of each pupil with an oral quiz. To meet this situation 
the uniform written examination was adopted. The results were so grati- 
fying that Horace Mann wrote the enthusiastic defense of written exami- 
nations to which reference has already been made. 

In the latter part of the last century considerable pressure was being 
brought to bear from the outside upon the schools to make place for certain 
new and practical subjects such as manual training and home economics. 
But the school men opposed the move on the ground that there was hardly 
time to teach the subjects already in the curriculum. Then, in 1894, 
Dr. J. М. Rice had what he called an “inspiration.” He says:” 

In truth, however, I came to recognize that this (the claims of school men follow- 
ing different courses of study) was all talk,—that no one really knew the facts, 
because there were no standards to serve as guides, Then one day, the idea flashed 
through my mind that the way to settle the question was to try it out. For a 
beginning I decided to take spelling, and on that very day I made up a list of 
50 words with the view of giving them as a test to the pupils of the schools as I went 
on my tour from town to town. I have no record of the date of the inspiration, but 
I think it was some time in October, 1894. 


This was the origin of the important spelling inquiry which started a move- 
ment that not only transformed the teaching of spelling but brought to the 
fore a new technique for settling educational issues. 

The schools of every period have apparently had to meet the criticism 
that they are not so efficient as those “in the good old days." Usually, again, 
there is no defense except argument based upon mere opinions. The criti- 
cism was especially severe in the early years of the present century. Just at 
this time, in 1906, a fortunate event occurred which taught educators a 
second lesson in the value of comparative examinations. John L. Riley of 
Springfield, Massachusetts, discovered in an old attic a set of examinations 
Which had been given in the Springfield schools in the year 1846. The 
thought occurred to him to give these same examinations to the pupils in 
__ 


“Quoted by Leonard P. Ayres, op. cit, page 11. Quoted by permission of the Society. 


16 THE PROBLEM OF MEASUREMENT 


the same city in 1906, just sixty years later.“ In spite of the changes in the 
content of the subjects, he found the results distinetly favorable to the 
later schools. In ninth-grade spelling, for example, the pupils in 1846 had 
averaged 40.6 per cent, while the average was 51.2 per cent in 1906. In like 
manner, the geography average had risen from 40.3 per cent in 1846 to 
53.4 per cent in 1906. But the greatest superiority was in the case of arith- 
metic, where the increase was from an average of 29.4 per cent in 1846 to 
65.2 per cent in 1906. It was evident, therefore, that the facts were the most 
effective tools with which to meet criticism, and that comparative exami- 
nations were very useful in supplying these pertinent facts.” 


D. The History of Character, Personality, and Interest 
Measurement 


Crude beginnings. It is probably true that human beings began to 
pass judgment upon each other and to attempt evaluation of each other's 
character and personality long before the dawn of recorded history. But 
from the standpoint of measurement, these early efforts were both un- 
systematic and untrustworthy. Even when somewhat later these methods 
were reduced to systems, the results were still little better than chance. 
Examples of such prescientifie systems which have exerted wide influence 
upon a credulous public are astrology, graphology, palmistry, and phren- 
ology. 

In spite of the antiquity of these attempts at evaluating personality and 
character, the scientific study of this field is comparatively new. Nor can 
it be said that the earlier pseudo-scientific approaches have, ceased to in- 
fluence the popular mind. Galton pioneered here as in so many other aspects 
of measurement. More than 60 years ago he came to the conclusion that 
“the character which shapes our conduct is a definite and durable ‘some- 
thing,’ and that it is therefore reasonable to attempt to measure it.” 
He proposed rating scales with statistical analyses of results, and what he 
termed “rude experiments" suggested many later investigations. Without 
doubt Galton’s ingenious suggestions marked the beginning of the scientific 
measurement of character. 

Further development. In later years analysis and measurement of 
personality and character were greatly stimulated by the interest in 


‘8 John L, Riley, The Springfield Tests. Springfield, Massachusetts: The Holden Patent 
Book Cover Co., 1908. 

+ The discerning student may have noted that the validity of this technique depends 
greatly upon the obviously erroneous assumption that educational influences outside 
the school were constant from 1846 to 1906. Attempts to prove the effectiveness of 
“modern” educational practices by this method usually impress the lay public, but 
from the standpoint of strict logic they are doomed to be inconclusive. In Riley’s study, 
however, the marked superiority found for arithmetic as ‘compared with geography may 
partially nullify this type of objection. 1 
B do Galton, "Measurement of Character,” Fortnightly Review, 42: 179-185, 


—————— 


San. 


— 


— 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 47 


educational and vocational guidance, mental hygiene, and charaeter educa- 
tion," It was soon recognized that success along these lines was conditioned 
upon the ability to measure other things besides general intelligence and 
academic achievement. With respect to character education, for example, 
one of the leaders in that field, Lentz, said: *Character education without 
character measurement would appear to be as logical as target practice 
in the dark, good shots and poor ones being equally gratifying.” 

The first attempt to measure character by a test was probably that of 
Fernald in 1912, but the author's claims for the test were very modest.” 
Voelker, in 1921, devised some actual test situations for measuring char- 
acter. By far the most ambitious attempt so far made is that of the Char- 
acter Education Inquiry, under the direction of Hartshorne and May, 
which extended over the five-year period, 1924-1929. These workers sub- 
jected all the promising tools then in existence to rigid trial, and devised 
many new and ingenious techniques of their own. Their main effort was 
directed at-selecting representative and varied life situations which would 
afford a valid index of the totality of the character of the individual. 

Most of these methods, however, had had interesting historical ante- 
cedents. For example, the celebrated physician, Galen, who lived in the 
- Second century, employed methods which were not unlike those in current 
use. On one occasion Galen, employed by the emperor to find out whether 
| the empress was in love with a certain courtier, attempted to do so by 
_ having the suspected lover appear in the presence of the empress while the 
- vhysician felt her pulse to determine the change in heart beat! It will be 
Aoted that Galen’s technique suggests modern physiological methods of 

pressure, psycho-galvanometer, and the like, as well as the “sam- 

pling” method of the performance tests. 
It is doubtful whether, as a rule, tests of actual performance have suffi- 
_ ciently demonstrated their superior validity and reliability as measures of 
character and personality to justify the additional expense and inconven- 


? Some idea can be had of the extensive literature on the subject from examining the 
annual volumes of the Review of Educational Research. A selected bibliography on char- 
acter, including 282 titles, selected from a complete list of about 1,000 titles, appeared 
in 1932. A supplementary bibliography for the next three years which appeared in 1935 
included over 400 titles. Analogous chapter titles in the 1953 “Educational and Psycho- 
logical Testing” issue are “Development and Applications of Nonprojective Tests of 

nality and Interest" and “Development and Applications of Projective Tests of 
Personality,” Character measurement per s¢ receives very little attention today. How- 
ever, see Vernon Jones, “Character Development in Children—An Objective Approach,” 
pter 14 in Leonard Carmichael (Editor), Manual of Child Psychology. New York: 
John Wiley & Sons, 1946. 

" Theodore F, Lentz, Jr., An Experimental Method for the Discovery and Development 
of Tests of Character, page 2. New York: Bureau of Publications, Teachers College, 
Columbia University, 1925. 

Й Boso. Fernald, “The Defective Delinquent Class Differentiating Tests," American 
© TRürnal of Insanity, 68: 524—594, 1912. ) 
А complete report was published by The Maemillan Company іп three volumes. 


48 THE PROBLEM ОР MEASUREMENT 


ience involved in their administration, except for purposes of research. 
In his summary Goodwin Watson says: “It is probable that the last five 
years have brought some swing of psychological interest away from 
personality-test techniques and toward more emphasis upon the study of 
personality through ratings, anecdotal records, observation of behavior, 
and case studies." * In the final chapter of Symonds’ comprehensive and 
analytical book, Diagnosing Personality and Conduct, the author makes 
this statement: “Probably the greatest usefulness will be found in ratings, 
the questionnaire, and the interview for obtaining evidence as to adjust- 
ments toward the environment, personal evaluation, attitudes toward re- 
ality, sexual relationships, morals, and feelings.’’** The same author notes 
clear evidence of a lag both in clinical research and in the translation of 
this research into educational practice. He says: 


Work which had been done in the decade from 1910 to 1920 on mental and 
achievement tests was being assimilated in educational practice in the 1920. 
Basic work on the measurement and evaluation of personality that had been carried 
out in the 1920’s was being assimilated in educational practice in the 1930’s. Basic 
work on child guidance procedures, discussions of the meaning of mental hygiene 
in education, and the problems of pupil adjustment which were being investigated 
and elaborated in the 1930’s is only now being assimilated in the practice of educa- 
tion in the schools of the nation. 


The historical development of rating scales, questionnaires, and interviews 
will now receive brief consideration. Strictly speaking, rating scales and 
questionnaires are only devices for recording the judgments of observers, 
rather than true measuring instruments.® 

Rating seales. The first rating scale in a modern sense was probably 
that of Galton for mental imagery, which was published in 1883. About 
the time of the appearance of the first Binet test, Karl Pearson proposed 
a scale for judging intelligence. One of the most famous scales specifically 
for measuring traits of personality is the Scott Man-to-Man Scale, intro- 
duced and extensively used during World War I. It remained for Hart- 
shorne and May to restore somewhat the lost prestige of the rating pro- 
cedure, partly by changing the name to “reputation measures,” but 
mainly by improving the technique.” 


55 Thirly-Seventh Yearbook of the National Society for the Study of Education, Part TI, 
page 369. Quoted by permission of the Society. Bloomington, Illinois: Public School 
Publishing Company, 1938. 

56 Percival M. Symonds, Diagnosing Personality and Conduct, page 567. New York 
D. Appleton Company, Inc., 1931. Used by permission of D. Appleton-Century Conr 

ny. 

57 Percival M. Symonds, ‘The Lag in Clinical Research,” Journal of Educational Ке 
search, 38: 371-374, January, 1945. Page 373. 

55 B. Othanel Smith, Logical Aspects of Educational Measurement, Chapter I. New 
York: Columbia University Press, 1938. 

59 A historical note by the authors offers another illustration of the kinship betwee! 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 49 


— The questionnaire. The invention of the much-maligned questionnaire 
is often ascribed to that versatile Englishman, Sir Francis Galton. But that 
the instrument was in existence at an earlier period in England, in fact 

- if not in name, is evident from the following critical statement: 


It is impossible to expect accuracy in returns obtained by circulars, various 
eonstruetions being put upon the same question by different individuals who 
consequently classify their replies upon various principles. 


But Galton undoubtedly improved and used extensively the questionnaire, 
whieh was imported to Ameriea about 1880 by G. Stanley Hall. The in- 
strument exists in many forms today, but its principal use in education is 
for measuring adjustment, attitude, and interest. 

The Woodworth Personal Data Sheet began in 1917 as a method of 
measuring the ability of soldiers to adjust themselves to the trying condi- 
tions of army life. In 1923 Mathews adapted it for school use. The same 
year Cady published another revision which has been widely used with 
children in their teens. Two years later Laird made an adaptation for col- 
lege use and provided a graphic rating scale for scoring. In 1919 Pressey 

- published his widely used X-O Test, a sort of questionnaire covering a 
miscellaneous assortment of items having to do with emotionality. The 
- questionnaire has also been used to measure other types of adjustment, 
- Buch as introversion-extroversion by Marston and others, and ascendance- 
submission by Allport. 
The development of wholesome attitudes has been recognized in recent 
| years as an important objective of education. Since about 1920 much at- 
tention has been devoted to the measurement of attitudes of various kinds. 
Hart’s test of social attitudes and interests which appeared in 1923 and 
| Watson's measurement of fairmindedness which appeared in 1925 are good 
. examples. Beginning in 1928 Thurstone has been responsible for important 
Improvements in the units of measurement employed in attitude question- 
naires on many subjects. By having his questions scaled by a group of 
judges, Thurstone has found it possible to secure satisfactory results with 
fewer items. Figure 3 shows an adaptation of this weighting technique 
арріеа to a scale for measuring a pupil's attitude toward high school.*! 


Bx 


necessity and invention: “For a while it seemed that... rating scales as scientific 
uments would be completely discarded. It was necessity that saved the day. While 
everyone talked about the superiority of objective tests, yet it was soon found that 
. папу qualities of character yield only stubbornly and expensively to objective testing. 
character and personality studies were to continue, ratings had to be revived. In spite 
— 9f all their difficulties, snares, delusions, and pitfalls, they are now staging a consider- 
able ‘comeback,’ " The J. ournal of Social Psychology, 1: 66, February, 1930. 
© Quoted from J. ournal of the Statistical Society of London, October, 1839, by Walter 
- Monroe and Max D. Engelhart, The Scientific Study of Educational Problems, page 40. 
E York: The Maemillan Company, 1936. 3 
Gil For а discussion of this scale, see H. H. Remmers, G. C. Brandenburg, and F. H. 
illespie, “Measuring Attitude Toward the High School" Journal of Experimental 
ducation, 2: 60-64, September, 1933. 


50 THE PROBLEM OF MEASUREMENT 


Tn this ease the scale values range from .6 for Item 1 to 10.1 for Item 22. 
The pupil’s score is the median scale value of the items checked. 

The close relationship of interest to guidance, whether educational, voca- 
tional, or personal, has stimulated considerable activity directed toward 
its measurement. In 1907 G. Stanley Hall published a questionnaire study 
of the recreational interests of children, which exerted a wide influence, 


HIGH SCHOOL ATTITUDE SCALE* 
Form B 


Below is a list of twenty-five statements about school. Place a check mark 
before each statement with which you agree, and leave unmarked those with 
which you disagree. This test will in no way affect your standing in school. 


School is like a prison. 
I have a lot of fun in school. 
School is all right. 
My teachers always treat me fairly. 
Ilike to go to school to be with other people. 
Many of our great men have no high school education. 
I hate most school work. 
Some things about high school are all right. 
High school training develops personality. 
High school is a good thing for some people and a bad thing 
for others. 
I would just as soon stay at home as go to school. 
All the better class of people have high school educations. 
The high schools lift the plane of sportsmanship in a com- 
munity. 
------14. Тоо much money is being spent on high schools for the benefit 
received. 
15. The high school teaches mostly old useless information. 
A net 16. They won’t teach things one really wants to know in high 
school. 
noe ete 17. If one has plenty of money it may be all right to go to high 
school. 
I haven't any definite like or dislike for high school. 
Any old fogy knows more than a high school graduate. 
The kindest and best people I know don’t have a high school 
education. 
High school cramps and dwarfs one’s personality, 
America could not stand as a nation if it were not for our 
high schools. 
Our high schools teach immorality and indecency. 
High school training develops high ideals in pupils. 
High schools develop loyalty. 


* Prepared by F. H. Gillespie and published by Purdue University. 


Figure 3. А Scale for Measuring Pupils’ Attitudes Toward High School. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 51 


This pioneer study has been followed by many others, of which the most 
extensive is probably that of Lehman and Witty, published in 1927. Moore 
appears to have been the first to use this technique for the measurement 
of vocational interests in 1921. Two other studies, employing somewhat 
different techniques, appeared in 1926. One of these, by Miner, offered 
paired comparisons, and the other, by Cowdry, employed a complicated 
scheme for weighting the scores. The best known of all are the Strong 
Vocational Interest Blank, which appeared in 1927, and the Kuder Prefer- 
ence Record, which followed in 1939. 

Paper-and-pencil personality inventories have been severely criticized 
in recent years, especially by a prominent clinical psychologist, Albert 
Ellis. He speaks out against the more elaborately scored and tediously 
interpreted instruments of this kind:® 

. . . since personality inventories depend, in the last analysis, on printed ques- 
tions, and since virtually no one claims for them the advantage of getting at un- 
conscious and semi-conscious material which effective clinical use of interview and 
projective methods are generally conceded to some extent to uncover, it is difficult 
to see why clinicians should spend considerable time first mastering and then using 
these inventories. 


The interview. One of the oldest forms of obtaining knowledge is the 
personal interview. It has always been an important tool in the hands of 
certain professional men, such as lawyers, doctors, and newspaper re- 
porters; but it is used by everybody to some extent. Its chief value in 
education is probably in diagnosis and guidance.* The interview is used 
to supplement the ordinary objective evidence about a pupil, such as is 
afforded by his school record and by a firsthand knowledge of such things 
as his feelings and point of view. The evidence indicates that the personal 
qualities of the interviewer are fully as important as the technique em- 
ployed. 

Recent developments. Three promising developments relating to meas- 
urement require brief mention. The first of these is the attempt to subject 
personality to elaborate statistical analysis. Leaders in this movement have 


"In “The Validity of Personality Questionnaires,” Psychological Bulletin, 43: 385- 
440, September, 1946. Albert Ellis and Herbert S. Conrad are somewhat more favorable 
to them in “The Validity of Personality Inventories in Military Practice," Psychological 
Bulletin, 45: 385—426, September, 1948. 

* Albert Ellis, “Recent Research with Personality Questionnaires,” Journal of Con- 
sulting Psychology, 17: 45-49, February, 1953, Page 48. Quoted with permission of the 

ournal of Consulting Psychology and the American Psychological Association. For a 
Tebuttal see Allen Calvin and James McConnell, “Elis on Personality Inventories,” 

Journal of Consulting Psychology, 17: 462-464, December, 1953. 

“Thorndike has described the Stanford-Binet as “an approved, systematized and 
Standardized interview." The Measurement of Intelligence, page 1. New York: Bureau 
of Publications, Teachers College, Columbia University, 1927. 

55 Valuable suggestions on the interview are found in: Marie Jahoda, Morton Deutsch, 
and Stuart W. Cook, Research Methods in Social Relations, Parts 1 and IT, Chapters 

› XU, and XIII. New York: The Dryden Press, 1951. 


52 THE PROBLEM OF MEASUREMENT 


been Spearman, Kelley, Thurstone, Flanagan, R. B. Cattell, Eysenck, and 
Stephenson. The second development is that of controlled observation or 
time-sampling, which is employed mainly in the child-study laboratories, 
A third development has been in the direction of measuring public opinion. 


E. Some Important Publications 


From the beginning, professional journals, books, and other publications 
have exerted a profound influence upon experimental psychology and the 
testing movement. Only the most important ean be mentioned here. 

Professional journals. The value of professional journals is that they 
keep the professional worker in one area continuously informed of what 
is going on in his own area and elsewhere. The first psychological journal 
was Мата, founded in England by Bain in 1876. For eleven years it re- 
mained the only psychological journal in the English language, and so was 
the vehicle for most of the important psychological articles both in England 
and in America. One of the most important articles on measurement during 
the early years was probably that by Cattell entitled “Mental Tests and 
Measurements," which appeared in 1890 and contained some significant 
comments by Galton. The first psychological journal in America was the 
American Journal of Psychology, started by G. Stanley Hall in 1887. It has 
published many significant articles on measurement and statistics, but 
doubtless none more important than Spearman’s “General Intelligence 
Objectively Determined and Measured,” which appeared in 1904. This was 
the original formulation of the now well-known two-factor theory of intel- 
ligence, and the beginning of “correlational psychology," which was in- 
fluential in directing the attention of psychologists from faculties to factors. 

Hall also founded Pedagogical Seminary (now the Journal of Genetic 
Psychology), which may be regarded as the first journal of educational 
psychology, in 1891. Three years later Binet started L’ Année Psycholo- 
gique, which was to be the principal agency for bringing his extensive work 
on the measurement of intelligence to the attention of the world. The 
Teachers College Record, started in 1900, has published many of Thorndike's 
important studies, and those of his students and co-workers. The first 
volume of the Journal of Educational Psychology, founded in 1910, con- 


*5 Hadley Cantril (Editor) and Mildred Strunk (Compiler), Public Opinion 1985-16. 
Princeton, N. J.: Princeton University Press, 1951. 1191 pages. 

George H. Gallup, A Guide to Public Opinion Polls. Princeton, N. J.: Princeton Uni- 
versity Press, 1944. 104 pages. 

Mildred B. Parten, Surveys, Polls, and Samples. New York: Harper, 1950. 624 pages. 

For a critical summary and extensive bibliography see: Quinn McNemar, “Оріпіоп- 
Attitude Methodology,” Psychological Bulletin, 43: 289-374, July, 1046. This article 
aroused considerable controversy in the Psychological Bulletin: Leo P. Crespi, “ ‘Opinion- 
Attitude Methodology’ and the Polls—a Rejoinder,” 43: 562-569, November, 1946; 
Herbert B. Conrad, “Some Principles of Attitude-Measurement: a Reply to ‘Opinion- 
Attitude Methodology,’ " 43: 570-589, November, 1946; and Quinn MeNemar, “Re 
sponse to Crespi’s Rejoinder and Conrad’s Reply to Appraisal of Opinion-Attitude 
Methodology,” 44: 171-176, March, 1947. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 53 


tained an article by Huey on “The Binet Seale for Measuring Intelligence 
and Retardation.” Two other journals, School and Society, and Educational 
Administration and Supervision, both founded in 1915, have included many 
reports on the use of tests both for research purposes and for the actual 
work of instruction and school administration. But probably no journal has 
been more important than the Journal of Educational Research, started 
in 1920. From the very first issue, which contained MeCall's article on 
“A New Kind of School Examination,” to the present time, it has exerted 
a wide influence upon the measurement movement. 

Psychometrika, which began publication in 1936 as “а journal devoted 
to the development of psychology as a quantitative rational science," car- 
ries many articles applying mathematical procedures to measurement prob- 
lems. On a less mathematical level is Educational and Psychological Meas- 
urement, begun in 1941. It is particularly valuable for the reasonably well 
trained guidance worker. Even less technical is the Personnel and Guidance 
Journal, formerly called Occupations. Two other publications whose ar- 
ticles frequently concern tests are the Journal of Applied Psychology and 
the Journal of Consulting Psychology. 

Measurement is such an important topic that practically all professional 
journals, as well as many “popular” magazines, deal with various aspects 
rather often. 

Some important books. Some of the important books are briefly men- 
tioned in chronological order, beginning with the pioneer period, which 
included roughly the first two decades of the century. In 1904 appeared 
E. L. Thorndike’s An Introduction to the Theory of Mental and Social 
Measurements, which made available for the first time to American stu- 
dents the statistical techniques necessary for educational research and 
measurement, Ten years later Truman Kelley's Educational Guidance** in- 
troduced educational workers to the alluring possibilities of partial and 
multiple correlations. Two important books appeared in 1916. One of these, 
Daniel Starch’s Educational Measurement? was the first book on achieve- 
ment tests, and the other, L. M. Terman's The Measurement of Intelli- 
gence,” was the first adequate treatment of intelligence tests in the English 
language. In 1918 appeared the Seventeenth Yearbook of the National Society 
for the Study of Education,” Part II of which treats in some detail the his- 
tory of the pioneer period in testing, and which gives descriptions of existing 
tests with suggestions as to their use. But it is probably most famous for 
containing Thorndike’s statement, “Whatever exists at all exists in some 
amount,” which has been accepted as a sort of creed by many workers 


eg Se 


* Published by Bureau of Publications, Teachers College, Columbia University. 
i Published by Bureau of Publications, Teachers College, Columbia University. 
3 Published by The Macmillan Company, New York. 

; Published by Houghton Mifflin Company, Boston. ^ Hit 

* Published by Publie School Publishing Company, Bloomington, Illinois 


54 THE PROBLEM OF MEASUREMENT 


in the field. The following year, in 1919, appeared Carl Seashore's The 
Psychology of Musical Talent,” a pioneer study in the measurement of apti- 
tude in a restricted field. 

Since the beginning of the third decade of the century, the “quantity 
production” stage referred to by Monroe has been achieved not only in the 
publication of tests but in books as well. Only a few of these can be men- 
tioned here as representative of types. In 1922 appeared W. A. McCall’s 
How to Measure in Education,” a comprehensive and critical book on 
achievement tests. The next year saw the publication of Ben D. Wood's: 
Measurement in Higher Education,” the first treatise on measurement a 
the college level. In 1924 appeared G. M. Ruch’s The Improvement of the 
Written Examination,” which was the first book wholly devoted to the new- 
type test. The year 1927 was especially productive, for at least five impo 
tant books on measurement bore that date of publication. There were two 
notable books on intelligence, E. L. Thorndike’s The Measurement of Intel- 
ligence* and C. E. Spearman’s The Abilities of М an," each representin 
a distinct point of view. In 1927 also appeared the first two books specifi: 
cally devoted to measurement in the high school, P. M. Symonds’ Меази 
ment in Secondary Education, and G. M. Ruchs and G. D. Stoddard's 
Tests and Measurements in High School Instruction.” The same year Tru- 
man Kelley’s critical discussion entitled Interpretation of Educational Meas- 
urements,” was published. The next year, in 1928, appeared another critical 
volume, Clark Нав Aptitude Testing, destined to become one of the 
classics in the field of measurement. 

During the 1930's there appeared numerous books and monographs on 
the various phases of measurement and their application to the different 
educational levels. Some evidence that measurement was coming of age is 
afforded by the fact that extensive bibliographies of tests and scales ap- 
peared during this decade. The first edition of Gertrude Hildreth’s Bibliog- 
raphy of Mental Tests and Rating Scales? was published in 1933. Three 
years later came Oscar Buros’ Educational, Psychological, and Personality 
Tests of 1988, 1934, and 1985 83 the forerunner of The Mental Measurements 
Yearbooks, the first volume of which appeared in 1938.* The publication 

?? Published by Silver, Burdett and Company, New York, ! 

7$ Published by The Macmillan Company, New York. 

1 Published by World Book Company, Yonkers. 

15 Published by Scott, Foresman & Company, Chicago. 

1в Published by Bureau of Publications, Teachers College, Columbia University. 

7" Published by The Macmillan Company, New York. 

тв Published by The Macmillan Company, New York. 

тэ Published by World Book Company, Yonkers, N. Y. 

*? Published by World Book Company, Yonkers. 

в Published by World Book Company, Yonkers. 

** Published by The Psychological Corporation, New York. 

вз Published by Rutgers University Press, New Brunswick, New Jersey. 

* Published by Rutgers University Press, 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 55 


of this critical volume marked an important milestone in the history of 
educational measurement. T'he Nineteen Forty Mental Measurements Year- 
book, The Third Mental Measurements Yearbook (1949),% and The Fourth 
Mental Measurements Yearbook (1953)* are Buros' invaluable sequels to 
these earlier works. 

David Wechsler's The Measurement of Adult Intelligence,* a comprehen- 
sive manual accompanying the first individual intelligence test designed 
especially for testing adolescents and adults, was first published in 1939. 
The Wechsler-Bellevue Intelligence Scales have been well accepted, partic- 
ularly by clinicians. The Wechsler Intelligence Scale for Children? emerged 
in 1949 as a serious competitor of the Revised Stanford-Binet Intelligence 
Зее in the age range 5-15 years. 

Though World War II temporarily interrupted the publishing efforts of 
most measurement specialists, during the post-war years a veritable deluge 
of excellent books hit the market in rapid-fire succession.” Among these 
were Frederick B. Davis’ Utilizing Human Talent™ and Dorothy C. Adkins 
and others’ Construction and Analysis of Achievement Tesis® (1947) ; Lee J. 
Cronbach’s Essentials of Psychological Testing? Florence L. Goodenough’s 
Mental Testing,” William Stephenson's Testing School Children, Donald E. 
Super's Appraising Vocational Fitness, and Robert L. Thorndike’s Person- 
nel Selection, Test and Measurement Techniques" (1949); Frank S. Free- 
man's Theory and Practice of Psychological Testing and Harold Gul- 
liksen’s Theory of Mental Tests? (1950); and E. F. Lindquist and others’ 
Educational Measurement ™® (1951). 

The Goheen-Kavruck Selected References on Test Construction, Mental 
Test Theory, and Statistics, 1929-1949" greatly facilitate research, putting 
at one’s finger tips the titles and sources of much of the important material 
which appeared during those two decades. 


35 Published by the Gryphon Press, Highland Park, N. J. 

вв Published by Rutgers University Press. 

*' Published by Gryphon Press. 

55 Published by the Williams & Wilkins Company, Baltimore, Maryland. 

з Published by The Psychological Corporation, New York. 

*? Julian C. Stanley, “Five Recent Educational and Psychological Measurement Text- 
books,” Harvard Educational Review, 22: 57-61, Winter, 1952. 

э Published by the American Council on Education, Washington, D. C. 

э Published by the U. S. Government Printing Office, Washington 25, D. C. 

% Published by Harper & Brothers, New York. 

** Published by Rinehart & Company, New York. 

*5 Published by Longmans, Green and Company, New York. 

% Published by Harper & Brothers, New York. 

*' Published by John Wiley & Sons, New York. 

** Published by Henry Holt and Company, New York. 

9 Published by John Wiley & Sons, New York. 

1 Published by the American Council on Education, Washington, D. C. 

wt Published by the U. S. Government Printing Office, Washington 25, D. C., 1950 


56 THE PROBLEM OF MEASUREMENT 


F. Some Relatively Recent Tendencies 


Test construction. The "quantity production" stage of test consti 
tion in America now seems definitely past. The emphasis has turned t 
quality, although it is still too much to say that a recent copyright dat 
on a test is ample assurance of high merit. Kelley's observation made it 
1927 that “the ruts of the test movement are already so deep that there an 
many who do not see beyond them” ™ is still, unfortunately, true. Н Y 
ever, test makers as a group no longer unblushingly make the enthusi 
claims for their products that were common a few years ago. Instead t] 
has grown up a more critical and becomingly modest attitude, which 
probably the most characteristic feature of the present trend. One 
observer" as early as 1920 noted “evidences of the beginnings of a c 
attitude toward educational tests." 2 " 

Another tendency, largely an outgrowth of this critical attitude, is t 
extend the field of measurement into new areas and to develop new, and 
usually more specific, types of tests. For example, instead of a prim 
interest in developing tests of general intelligence, the emphasis із дрот 
developing tests of specific intelligence along particular lines. Reading read- 
iness and other aptitude tests are representative of the trend. Even 
called general intelligence tests that have appeared recently often y 
several scores which have possible diagnostic value. Increased attention 
the reliability, and more particularly to the validity, of tests and indivi 
test items is also a notable trend. The result has been the appearance of 
types of test forms and test situations. Along with this has come the reali 
tion that standard tests do not fully meet all the needs of measurement, 
that in consequence greater emphasis must be placed upon the developmen 
of improved techniques for constructing informal teacher-made tests 
other techniques of evaluation. 


Monroe makes the following excellent summary of the situation :™ 


The most significant trends appear to be (1) the attack upon and the consequent 
discrediting of essay examinations, (2) the development of objective tests, and 
the emphasis upon reliability as the criterion by which measuring instruments were 
evaluated, and (3) the development of diagnostic and prognostic uses. What of 
the future? Any attempt to project lines of development into the future is attended 
with uncertainty. But, if I interpret correctly current educational writings in th 
field, three trends are indicated: (1) а growing emphasis upon validity and 
consequent decreasing emphasis upon reliability as the criterion for evaluating 


1 Truman Lee Kelley, Interpretation of Educational M. easurements, page 16. Yonkers: 
World Book Company, 1927. 3 
7? Walter S. Monroe, “Educational Measurement in 1920 and 1945." Journal 9) 
Educational Research, 38: 334-340, January, 1945. 

1° Walter 8. Monroe, “Some Trends in Educational Measurement," Twenty-Fourth 
Annual Conference от’ Educational Measurements, page 35. Bulletin of the School of 


Education, Indiana University, Vol. XIII, No. 4. Bloomington, Indiana: Bureau of. 
Coóperative Research, 1937, 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 57 


measuring instmments; (2) a decline of the faith in indirect measurement, and an 
inereasing emphasis upon direct measurement as a means of attaining satisfactory 
validity; and (3) a growing respect for essay examinations as instruments for 
measuring certain outcomes of instruction. 


Use of tests. A British psychologist has suggested that a new scientific 
technique seems to go through three stages, as follows;!* 


The first is the early state of development when no one, except its inventors, 
is interested in it, and those working on other lines regard it with indifference or 
suspicion or else think it silly. In the second stage it begins to gain support, and 
in the third stage everyone wants to use it whether they understand it or not. 
There is then danger of a fourth stage of disillusionment, and this is the time for 
critical examination. 


In the case of standard tests in America, the stage of indifference and 
suspicion, with which Rice's spelling inquiry was met, had largely passed 
when the first standardized tests appeared during the first decade of the 
present century. Since that time there have been three rather clearly de- 
fined stages, which may be designated as those of curiosity, confidence, and 
critical caution. 

The first stage was that of curiosity. In this stage teachers and school 
officials tried out tests merely because they were something new and be- 
cause their use gave evidence, if indeed superficial in character, of up-to- 
date-ness. This attitude tended to die a natural death as the novelty wore 
off. 

The second stage was that of confidence, or in some instances that of 
overconfidence. Standard tests were “swallowed, hook, line, and sinker.” 
Test results were uncritically accepted at their face value. IQ's were naively 
taken as accurate measures of innate capacity wholly apart from environ- 
mental opportunities, and so were as fixed as the laws of the Medes and the 
Persians." In like manner, achievement test scores were accepted as fully 
adequate measures of the important outcomes of instruction. If only such 
tests were objective, they were assumed to be sufficiently accurate for valid 
comparisons, not only of one school or class with another, but also of one 
pupil with another, or even of one aspect of a pupil's achievement with 
another aspect of his achievement.” There is some evidence that this atti- 
—— 

8 J. О. Irwin, “Correlation Methods in Psychology.” British Journal of Psychology, 
25: 86-91, July, 1034. у 

1% What happened to intelligence testing following World War I has been described 
as follows: “Аз many of the subjects tested were children of school age, because Binet’s 
Scores gave a good correlation with ability for school work, and perhaps because of the 
relative simplicity and economy of the methods, mental testing was oversold, and careful 
Psychological work in the field of individual differences still suffers from this effect." 
Francis N. Maxfield, “Trends in Testing Intelligence,” Educational Research Bulletin, 
15:137, May 13, 1936. 

297 What happened to achievement testing has been described as follows: ‘In the 
Widespread use of objective tests at the high school and college levels, there is apparent 


& child-like faith in the efficacy of objective tests as instruments for measuring school 
achievement. .,, A little knowledge has become a dangerous thing.” Walter 5. 


58 THE PROBLEM OF MEASUREMENT 


tude is on the decline, although unfortunately it is still fouftd too often in 
certain quarters. 

The third stage may be termed that of critical caution. While by no means 
universal, this more wholesome attitude, on the whole, characterizes the 
later phases of the testing movement. Hildreth points out some beneficial 
results of this change: “А more critical attitude toward intelligence meas- 
urement, as the outcome of continued experimentation, has resulted in 
more authoritative research findings, more sensible and intelligent inter- 
pretation of data.” 18 This attitude has shown itself with respect to achieve- 
ment tests and personality measurements as well. The result has been not 
so much the curtailment of the use of tests, as their more critical use and 
the more cautious interpretation of test scores. On the whole, the current 
emphasis is on the use of standard tests, not so much for comparative pur- 
poses, as to provide a basis for guidance and remedial instruction. It is 
increasingly recognized that tests are means and not ends, and that even 
the best test is but a tool, the value of which depends upon the skill and the 
intelligence with which it is used. 

This enlarged and more critical attitude on the part of enlightened 
school officials has been well stated by Maxfield: 


In problems of school administration the massed data from intelligence tests 
will be interpreted by statistical methods, In dealing with the problems of in- 
dividual pupils the case-study method of the clinical psychologist will prevail. 
Inventories of personality, scales of social adjustment, and the like, will supple 
ment tests of intelligence. Diagnostic tests will be supplemented by diagnostic 
teaching. But no synthesis or interpretation will be attempted without knowledge 
of the pupil’s physical condition, his home background, his previous school history, 
his vocational interests, his social and emotional reactions, and the like. The 
weight given in this synthesis to scores on intelligence tests will vary with the 
problem presented. The case-study method can be adapted to any philosophy of 
education and to any educational aims and objectives. 


Lindquist concludes his excellent forty-page chapter, “Preliminary Con- 
siderations in Objective Test Construction,” with a strong warning: 


If measurement is to continue to play an increasingly important role in educa- 
tion, measurement workers must be much more than technicians. Unless their 
efforts are directed by a sound educational philosophy, unless they accept and 
welcome a greater share of responsibility for the selection and clarification of 
educational objectives, unless they show much more concern with what they 


measure as well as with how they measure it, much of their work will prove futile 
or ineffective. 


Monroe, “Hazards in the Measurement of Achievement,” School and Society, 41: 48-49, 
January 12, 1935. 


108 Gertrude Hildreth, “Applications of Intelligence Testing,” Review of Educational 
Research, 5: 200, June, 1935. 
1 Francis N. Maxfield, op. cit., page 140. 


10 In Educational Measurement, page 158. Washington, D. C.: American Council on 
Education, 1951. 


DEVELOPMENT OF MEASUREMENT IN EDUCATION 59 


JŞELECTED REFERENCES FOR FURTHER READING 
Bovnton, Paul L., Intelligence: Its Manifestations and Measurement, New York: 


D. Appleton-Century Company, Inc., 1933. Chapters V and VI. 

Cook, Walter W., "Achievement Tests," in Walter S. Monroe (Editor), Encyclo- 
pedia of Educational Research, pages 1461-1478. New York: The Maemillan 
Company, 1950. 

Cook, Walter W., "What Educational Measurement in the Education of Teachers?" 
Journal of Educational Psychology, 41: 389-847, October, 1950. 

Eric F. Gardner, “Development and Applications of Tests of Educational Achieve- 
ment in Schools and Colleges,” Review of Educational Research, 23: 85-101, 
February, 1953. 

Freeman, Frank N., Mental Tests. Their History, Principles and Applications 
(Revised Edition). Boston: Houghton Mifflin Company, 1939. Chapters I-VIII. 

Moody, Caesar B., “Historical Outline of Concepts of Mental Ability,” Peabody 
Journal of Education, 30: 194-204, January, 1953. 

Odell, C. W., Educational Measurement in High School. New York: D. Appleton- 
Century Company, 1930. Chapter IT. 

Peterson, Joseph, Early Conceptions and Tests of Intelligence. Yonkers, N. Y.: World 
Book Company, 1925. 320 pages. 

Pintner, Rudolph, Intelligence Testing, Methods and Results (New Edition). New 
York: Henry Holt & Company, 1931. Part I. 

Rice, Joseph M., Scientific Management in Education. New York: Hinds, Noble & 
Eldredge, 1913. 282 pages. 

Ruch, G. M., The Objective or New-Type Examination. Chicago: Scott, Foresman 
& Company, 1929. Chapters I and III. 

Rugg, Harold O., Foundations for American Éducation. Yonkers, N. Y.: World 
Book Company, 1947. Chapter XXIII, “Fifty Years of Scientific Method in 
Education: What Have We Learned?” 

Thurstone, Louis L., and Chave, Ernest J., The Measurement of Attitude. Chicago: 
University of Chicago Press, 1929. 96 pages. 

U. S. Office of Strategic Services, Assessment of Men. New York: Rinehart & 
Company, 1948. 541 pages. 

Young, Kimball, “The History of Mental Testing,” Pedagogical Seminary, 31: 1-48, 
March, 1924. 


З 


The Statistical Analysis of Test Results 


==... [о 


А. General Considerations 


The importance of statistics. Measurement and evaluation are com- 
ing increasingly to depend upon statistieal procedures. Almost all test 
manuals discuss central tendency, variability, percentiles, standard scores, 
reliability, and validity, usually presuming that the reader understands 
certain commonly: used statistics fairly well. Educational literature of all 
types contains such concepts and statistics; they are also mentioned at 
professional meetings. Workers in all fields of education can anticipate 
heightened emphasis upon statistical thinking and techniques. 

The view taken by Helen Walker is even broader:! 


The conclusion seems inescapable that some aspects of statistical thinking which 


were once assumed to belong in rather specialized technical courses must now be 
considered part of general cultural education. 


Statistics in a capsule. Nearly every elementary measurement text, 


most general psychology books, and many other introductory volumes con- 


cates and W. Edwards Deming, 
cipation in Current Affairs,” School Review, 56: 262- 
269, May, 1948; and Millicent Haines, “Thinking Straight about Facts and Figures,” 

03, November, 1948, 


60 


STATISTICAL ANALYSIS OF TEST RESULTS 61 


tain a chapter on statistics which is hopefully designed to achieve the 
impossible—that is, to teach in a week or so material usually covered in a 
quarter, semester, or year-long statistics course. Though some of these 
chapters are very good indeed, failure to attain their unrealistic goal is 
frustrating to students and teacher alike. It is virtually impossible to teach 
very much statisties in a short period of time, nor is it practicable to devote 
a large portion of the measurement course to this area. 

The writers have tried to solve this dilemma by omitting the more ad- 
vanced material, by putting emphasis on concepts rather than computa- 
tions, by repeating main ideas frequently, and by saving certain techniques 
for the Appendix. Fifty multiple-choice items are presented in Appendix A, 
pages 429-435, to help the student test his grasp of basic principles. A sum- 
mary of common statistical terms and a selected list of statistics textbooks 
appear at the end of the chapter. 

Despite these aids, the reader will find that he cannot read the chapter 
like a light novel. It will require careful study, much as one would prepare 
chemistry or physics assignments. However, this effort should result in 
permanently improved ability to understand some forms of statistical com- 
munication, though real mastery cannot be expected. Very likely your 
teacher will want to be quite selective in his emphasis upon the various 
topics in this chapter. He may skip some of them entirely. 


B. Classification and Tabulation 


Before test scores or other quantitative data can be comprehended and 
interpreted, it is usually necessary to summarize them. Table 3 gives a 
class record for a reading readiness test administered at the beginning of 
the school year. The scores appear in alphabetical order as they are recorded 
in the teacher’s class roll book. However, the scores do not mean very much 
in this form. It is with some difficulty that we can tell whether Richard A, 
with a score of 90, for example, is a very superior or just an average pupil. 

Rank Order. Ordinarily the first step is to arrange the scores in order 
of size, usually from high to low. This is called an ungrouped series. In a 
small class, this is sometimes all that is necessary. Table 4 gives the same 
scores as Table 3 arranged in order of size. This table also shows the rank 
order of the pupils and the scores tabulated without further grouping. 
It is now easy to see that Richard A’s score of 90 gives him a rank of thir- 
teen in a class of thirty-eight, or about one third of the way from the top. 
Ina similar manner, it is easy to interpret each of the other scores in terms 
of rank. But this method, especially in classes of twenty or more pupils, 
is likely to prove unsatisfactory. Note, for example, that two pupils make 
& score of 97. Since it is not correct to say that one ranks higher than the 
other, it is necessary to assign them fractional ranks. As there are six pupils 
who rank higher, the next two ranks, 7 and 8, are averaged, which gives 7.5. 
In like manner the average of ranks 9 and 10 is 9.5, and so on for the other 
Pupils with tied scores. Since there are three pupils each of whom makes a 
Score of 75, and there are twenty-one pupils who rank above this score, the 


62 THE PROBLEM OF MEASUREMENT 


average of the next three ranks, 22, 23, and 24 is 23, which is the rank 
assigned each of the scores of 75. In addition to the fact that time and 
trouble are required to determine these ranks, the list is long and unwieldy 
to handle, and is inadequate for making comparisons with other classes 
which are much larger or much smaller. 

The frequency table or distribution. One way out of the difficulty 
is to arrange the scores in a special way. Such a process is called tabulation. 
The table itself is called a frequency distribution, or merely distribution. 


TABLE 3 
А Crass Record ror A Rwapina READINESS Тет 
(38 Pupils) 
Pupil Score 

Robart AEN р va Ета Sa TARE RU pet. 90 
Boo Hale pes ee 66 
ВИВАТ s CE Ee N ens: Jule 106 
Chara Big toto Seki si eet 84 


Mildted' СА S 105 
Robert C.. 83 
Robbin C. . 104 
Diney D... 82 
Jim D... 97 
John D... 97 
je rtg ol ето ca a 59 
Don F... 95 
Larry F.... 78 
Richard G. 70 
Warren Н.. 47 
Sylva H... 95 
Robert H.. 100 
Grover H.. 69 
Jack K.... 44 
Clarence K. 80 
Jerome L.. 75 
Mary M... 75 
Billy N.... 51 
Nanoy O... 109 
Carrie P. 89 
Ralph R. 58 
Billy S..... 59 
William S. 72 
Gretta S 74 
George S 75 
Robert 8 81 
Jaek S. RI 71 
Richard S.. 68 
Mary S.... 112 
Jean T..... 62 
Richard W... 91 
Dolores W... 


STATISTICAL ANALYSIS OF TEST RESULTS 63 


The third and fourth columns of Table 4 show the simplest form of a dis- 
tribution. Such a distribution consists of two columns; the various scores 
are arranged in one column in order of size, and opposite each score is 
recorded in the other column the number of times it occurs. Each entry 
in the second column is called a frequency, abbreviated f, and the total is 
represented by N. 

Tt is usually desirable, however, to carry the process one step further. 


TABLE 4 


Reapinc READINESS Scores FROM TABLE 3 ARRANGED IN ORDER OF SIZE AND 
RANK ORDER, AND TABULATED 


Tabulated Without Further Grouping 


Order of Size Rank Order — 
Score Frequency (f) 
112 1 112 1 
109 i 2 109 1 
106 3 106 1 
105 4 105 1 
104 5 104 1 
100 6 100 1 
97\ 75 97 2 
97! 7.5 95 2 
95 9.5 93 1 
95 9.5 91 1 
93 11 90 1 
91 12 89 1 
90 13 84 2 
89 14 83 1 
84) 15.5 82 1 
84] 15.5 81 1 
83 17 80 1 
82 18 78 1 
81 19 75 3 
80 20 74 1 
78 21 72 1 
75 23 71 1 
75 23 70 1 
75 23 69 1 
74 25 68 1 
72 26 66 1 
71 27 62 1 
70 28 59 2 
69 29 58 1 
68 30 51 1 
66 3l 47 1 
62 32 44 1 
59) 33.5 сі 
59 33.5 N = 38 
58 35 
51 36 
47 37 
44 38 


64 THE PROBLEM OF MEASUREMENT 


As a rule, there is so wide a range of scores that it is economical to group 
them according to size, such as a group including all scores from 110 to 114, 
inclusive, from 105 through 109, inclusive, and so on. Each group is called 
a class. The complete grouping arrangement is usually referred to as a 
grouped frequency distribution. While there is no absolutely fixed rule for 
the number of classes, it is usually advisable to make not fewer than twelve 
classes nor more than about fifteen. To have fewer than twelve classes is to 
run the risk of distorting the results, while more than fifteen classes pro- 
duces a table that is inconvenient to handle. 

Making the frequency table. There are four steps in making the ordi- 
nary grouped frequency distribution. These are illustrated in Table 5, 
using the scores given in Table 4. 

1. Determine the range, which is one more than the difference between 
the highest score and the lowest. Of these scores, the highest is 112 and the 
lowest is 44, which gives a range of (112 — 44) + 1 = 69. 

2. Select the class interval, which is the size of the groups into which the 
Scores are to be classified. To do this, divide the range by 12, which gives 
the largest group, or class interval, to be used; and by 15, which gives the 
smallest class interval to be used. In this case, 69 + 12 = 5.75, and 
69 + 15 = 4.6. Since it is impractical to use any class interval except a 
whole number, the fractions are disregarded and the next highest whole 
number is taken. The class interval will, therefore, be either 6 or 5. Of these 
class intervals it is best to choose the one which is more convenient to use. 
Odd-numbered intervals have whole-number midpoints when the class 
limits are fractional (end in .5), so usually they are to be preferred over 
even-numbered intervals, which have fractional midpoints. The midpoint 
of an odd-numbered group like 110-114, wherein there are 5 scores, is 112, 
but if a class size of 6 were used, the score limits being, for example, 
108-113, the midpoint of this even-numbered group would be 110.5, which 
might result in more complex computations. Therefore, a class interval 
of 5 is somewhat preferable to 6 when the class limits are fractional. 

3. Determine the limits of the classes. The table must, of course, be long 
enough to include the highest score and the lowest score. To facilitate tabu- 
lation start each class with a multiple of the class interval. If the lowest 
class starts with 40, which is a multiple of 5, it will accommodate the lowest 
Score, 44. Each succeeding whole-number class limit; will be 5 points above 
the one just below it. The next class will start at 45, the next at 50, and 
80 on, until the highest Score, 112, is included in the class 110-114. 

4. Make the tabulation. A short vertical line (tally) is drawn for each 
Score opposite the class in which it falls. To make a tabulation it is not 
necessary to have all the scores arranged in order, for this process may re- 
quire more time than the tabulation itself. In the original alphabetical list 
the first Score is 90. In the “tabulation” column opposite the class which 
begins with 90, a vertical line is drawn to indicate the score. The next 


STATISTICAL ANALYSIS OF TEST RESULTS 65 


score is 66. This falls in the class which begins at 65, so a line is made there. 
In the same way, a line is placed in the column opposite the appropriate 
class for each of the other scores. To indicate the fifth score in each class 
a diagonal line is drawn across the other four. This makes it easier to count 
the tallies in each class. 

The finished table omits the steps by which it was made. In the simplest 
form of a frequency distribution only two columns occur, the first of which 
shows the various classes, usually arranged in descending order, and the 


TABLE 5 


Ax ILLUSTRATION OF THE PROCESS OF MAKING A GROUPED Frequency DISTRIBUTION 


Original Scores 


(from Table 3) Steps in Making the Distribution 
90 Step 1. Determining the range. 
66 Highest Score 112 
106 Lowest Score 44 
84 CÓ 
105 Range = Difference +1 = 68 +1 = 69 
83 ———— 
104 Step 2. Selecting the class interval. 
82 69 + 12 = 5.8, largest class interval desirable. 
97 69 + 15 = 4.6, smallest class interval desirable. 
97 (5 chosen because of convenience in tabulation.) 
59 — 
95 Steps 3 and 4, Determining the limits of the classes and making 
78 the tabulation. 
70 
47 Whole-Number 
95 Limits of Classes Tabulation Frequency Cf) 
100 — slo es r EDO a a 
69 110-114 [ 1 
44 . 105-109 11 3 
80 100-104 I 2 
75 95-99 Nit 4 
75 90-94 Ii! | 3 
51 85-89 / 1 
109 80-84 : 6 
89 75-79 11 4 
58 70-74 lill M 
59 65-69 Ill 3 
72 60-64 І 1 
74 55-59 Ill 3 
75 50-54 / 1 
81 45-49 i 1 
71 40-44 ! 1 
68 З F5 
112 М = 38 
62 
91 
93 
84 


66 THE PROBLEM OF MEASUREMENT 


second of which shows the frequency or the number of scores in each class, 
When two or more schools or grades are to be compared, it is usually best 
to include all the data in the same table. In that case there will be a column 
for the classes into which the scores are grouped and one for each of the 
schools or grades being compared. Table 6 shows a frequency table which 
combines the record of six schools on a certain test. 'The number of grouping 
intervals varies from 9 for School F to 17 for Schools A and D. 


TABLE 6 


DISTRIBUTION оғ READING READINESS Scores ror SIX SCHOOLS IN А CERTAIN Crry 


School School School School School All Siz 

Score B С D E F Schools 
120-124 1 1 

115-119 

110-114 1 1 
105-109 3 2 2 7 
100-104 3 2 2 5 3 15 
95-99 6 4 4 4 5 23 
90-04 5 2 3 5 6 10 31 
85-89 4 4 1 4 4 1 18 
80-84 2 3 6 6 4 8 29 
75-79 10 5 4 4 1 2 26 
70-74 6 2 4 7 6 4 29 
65-69 9 4 3 4 1 21 
60-64 4 5 1 2 1 13 
55-59 1 3 1 5 
50-54 1 1 2 
45-49 1 1 2 
40-44 1 2 2 5 
1 2 
2 
1 
1 
38 37 40 36 234 


The form of the table. A few words may be said about the mechanical 
make-up of the table as it occurs in printed or typed form. Each table bears 
a number. Either Roman or Arabie numerals may be employed, but the 
latter seem to be increasingly favored. The table number may be centered 
above the title of the table, or it may be given at the beginning of the title. 
The table usually starts with two horizontal lines and ends with a single 
horizontal line. Another horizontal line separates the column headings from 
the body of the table, and other horizontal lines separate any summarizing 
measures which may be given under the table proper. Vertical lines may 
be used to separate the columns, but usually no lines are drawn along the 


STATISTICAL ANALYSIS OF TEST RESULTS 67 


margins of the page. It is considered good form to avoid abbreviations in 
the table whenever possible, and to make the title and headings full enough 
to indicate clearly the contents of the table. 

A two-way table, scattergram, or scatter diagram. Table 7 shows 
the chronological, educational, and mental ages of the 20 students in a 


TABLE 7 


Tur CHRONOLOGICAL, EDUCATIONAL, AND MENTAL AGES OF THE 20 Purns IN AN 
Eronta-Grape Crass > 


Ages Expressed in Months 


certain eighth-grade class. It is sometimes helpful to compare at the same 
A two-way table, called a scattergram 


time pupils’ scores on two measures. | Hem 
or seatter diagram, makes this easier. Table 8 contains a two-way distribu- 


tion of mental and educational ages from Table 7. 

Mental ages, grouped into class intervals of six months, make up the 
column headings; educational ages, grouped into class intervals of two 
months, constitute the rows. For example, Pupil A, with an MA of 208 
months and an EA of 188 months, falls in the third column from the right, 
or 204-209 class, and in the top row, or 188-189 class. In like manner, the 
horizontal position of each pupil in the distribution shows his MA, and 
the vertical position shows his EA. A tendency will be observed for the 
scores to arrange themselves in a diagonal pattern from lower left to upper 
right. This means that, in general, pupils who are low in MA are low in EA, 
and pupils who are high in MA are high in EA. However, a few exceptions 


68 THE PROBLEM OF MEASUREMENT 


. TABLE 8 


A Two-way DISTRIBUTION ог MENTAL AGE AND EDUCATIONAL AGE FOR AN 
Erenta-Grape Crass 


(r = .65) 


Mental Age (MA) in Months 


EA 
150-|156-|162-/168- 174-|180-186-|192-198-|204—1210-|216-| Frequency 
155 |161 |167 |173 |179 |185 |191 [197 


Educational Age (EA) in Months 
я 
P 


MA 
Frequency 


STATISTICAL ANALYSIS OF TEST RESULTS 69 


stand out. For example, Pupil P, who is lowest in MA, is in the fifth row 
from the bottom in EA. 

When the identification of individual pupils in the scattergram is un- 
important, the totals only are entered in the appropriate squares (the cells). 
For Table 8 all such entries would be 1’s, though in Table 18, page 92, 
numbers in the cells run as high as 4. 


C. Some Elementary Notions Concerning Quantitative Data 


Concepts versus computations. Two of the most important concepts 
that apply to various kinds of test data are variability and central tendency. 
These abstract notions are useful in summarizing the main features of a 
bewildering mass of figures. There are several commonly used measures. 
of variability and of central tendency. It is possible to be an adept com- 
puter of these measures without having a clear grasp of their meaning. 
Likewise, it is possible to understand the concepts reasonably well without 
being a competent computer, though a purely verbal knowledge of them 
is likely to be unsatisfying and inaccurate. 

In order to read test manuals and other measurement literature effec- 
tively, one needs a real understanding of the concepts of variability and 
central tendency. As an analyzer of test scores and similar material, one 
needs to acquire facility in computing a number of statistics. Compara- 
tively little of this computational ability can be picked up in a short 
measurements course. Perhaps all but the rudiments of calculation are best 
taught in a statistics course or an integrated measurements-statistics se- 
quence. 

Let us defer the computational routines for a few pages in an attempt to 
consider the concepts themselves. An imaginary, informal visit with a 
rookie teacher may help illustrate one aspect of variability and central 
tendency in a conerete situation. If you become curious concerning some 
of the statistics reported, look at pages 81-85 for computational pro- 
cedures. Х 1 

Joe's grading dilemma. Joe Doe, a brand-new English teacher, is 
faced with the problem of marking his first batch of English themes. He 
asked the 31 pupils in his tenth-grade class to “write about 500 words on 
your favorite sport." Now the papers are before him, the evening is young, 
and he wonders how to assign grades. у 

“Well,” says Joe, “These papers differ in lots of ways. Lee Littlesay 
wrote less than half a page, while Eve Effervescent managed to use up 
seven pages. Furthermore, the quality of handwriting varies greatly from 
one student to another, the girls being in general much better than the boys. 
Some of these individuals can’t spell, either, while others can. Hmm, if a 
pupil misspells the same word ten times, does that count off the same as 
misspelling ten different words once each? Some youngsters have some- 
thing to say but say it ungrammatically, and others are uninteresting but 


70 THE PROBLEM OF MEASUREMENT 


grammatically correct. It's obvious that there could be at least half a dozen 
different grades for each student. Perhaps I should try to give each paper 
a single ‘overall’ grade first." Let's help him mull through several ways 
to do this. 

One category. The simplest method would be to mark each of the 
31 papers with a check mark (\/), indicating that it is satisfactory. This 
would result in the distribution of scores in Table 9. 


TABLE 9 


Ах EXTREMELY Smpus Scoring METHOD 


Number or Frequency, 
Score Abbreviated “f” % 


v 3l 100 
ee cg Ties ceste c AIR 


Гуо categories. But it is obvious that, though easy on teacher and 
pupils alike, this procedure lumps everybody into a single category and 
does not discriminate among them at all. Suppose, instead, Joe uses a check 
mark for “satisfactory” and a minus sign (—) for “unsatisfactory.” Then, 
after grading all the papers, he consolidates the scores into the frequency 
distribution for “Two Categories” shown in Table 10. 


TABLE 10 


Five Consecutive EXPLORATORY STEPS IN ASSIGNING “OVERALL” SCORES TO 
31 Емамзн Тнемез 


Опе Two Three Four Nleven 
Category Categories Categories Categories Categories 
Ф 2 Ф 5 9 
E 5 3 Е 31:18 
Bj A| f Sia 7 915 еру | 121512; 
v 31 100 | v 26 84|-- i2 39 А |3 4 | 13 1 A+] 10 1 
— 5 16|-— 14 45| B| 2 8 | 26 2 А 9 1 
31 а lara 1 14 | 45 3.5| А-| 8 2 
31 F|0 5| 16 55| B+] 7 2 
31 8 B 6 3 
11 В 5 3 
Mean = 1.35 edi a а : 
Median = 1.25 25 C=|.9 3 
Mode = T 2851 D | 1 | 4 
Interquartile 31 F оя 
Range = 0.70 — SI 
2.03 = 1.33 
Q = 0.67 
SD = 0.90 


* Quality points. 


STATISTICAL ANALYSIS OF TEST RESULTS 71 


Twenty-six themes struck Joe as being "satisfactory," while he finally 
managed to classify 5 as "wnsatisfaetory." Thus 5 students out of 31, 
16 per cent of the class, failed. The other 84 per cent passed. 

Three categories. Do the “satisfactory” themes deserve grades of A, 
or B, or C? Joe doesn't know, since his two-category grading system fails 
to reveal any differences among the 26 students who earned check marks. 
Therefore he rescores these 26 papers, assigning a plus (+) to the “excel- 
lent" themes and a plus-minus (+) to the remaining ones. This gives him 
the three-category frequency distribution in Table 10: 41 (39 per cent) of 
the students submitted “excellent” themes, 31 (45 per cent) turned in 
"satisfactory" ones, and gr (16 per cent) were “unsatisfactory.” 

Four categories. It іѕчеаву enough for Joe to decide that the lowest 
5 pupils deserve grades of “F” and the middle 14 should get C's," but 
what about the top 12 persons? Not all of them did *A" work, thinks Joe, 
so he goes through the 12 best papers again, dividing them into 4 “А’5” 
and 8 B's." This gives him a new frequency distribution, the four-category 
one. Now the percentages are: A, 13 per cent; B, 26 per cent; C, 45 per cent; 
and F, 16 per cent. Thank goodness, muses Joe, I've finally spread these 
papers out enough to assign the usual grades. The simple check marks I 
started with didn’t disperse the pupils at all, because 100 per cent of them 
got the same grade. Now less than half of the individuals earn any one 
grade. 

Variability. Dimly Joe recalls something his measurement teacher said 
about “scatter,” “dispersion,” “heterogeneity,” variability. All we do with 
letter grades of A, B, C, and F is to put each pupil into one of four ordered 
categories, A being the highest and P the lowest. That’s what college 
“quality” or “honor” points were for, to indicate the rank or value of each 
category. A grade of A carried 3 quality points, B 2, C 1, and F 0. A bit of 
fancy calculation by Joe reveals that the range of the middle 50 per cent 
of the English class (that is, excluding the highest fourth and the lowest 
fourth) is from 0.70 to 2.03 quality points, a distance of 1.33 quality-point 
units. The type of computational procedure that he used is explained on 
pages 81-85. 

Joe becomes curious about the range of the middle 50 per cent in the 
preceding distributions. Going back to the three-category distribution in 
Table 10 he calls the minus category 0, the plus-minus category 1, and 
the plus category 2. This makes the range of the middle half of the class 
1.16 quality-point units, 0.17 less than when there were four categories. 

For the two-category distribution, where a check mark is 1 and a minus 0, 
the middle-half range is only 0.60, 0.56 less than the 1.16 secured for three 
categories and less than half the range of 1.33 for four categories. This 
pleases Joe, since it seems to confirm his hunch that variability increases 
in direct relationship to the number of categories. If this is so, he con- 
tinues, why not put each of the 31 individuals into a category by himseli— 


72 THE PROBLEM OF MEASUREMENT 


that is, why not rank-order the papers from 1 (highest) to 31 (lowest), 
with no ties? Then the range of the middle 50 per cent would be 15.5. 

Eleven categories. The rank-ordering goes along fairly well with the 
A and ће F papers, but it is troublesome indeed for those who earned C's. 
Despite his best efforts, Joe ends up with quite a few ties. The final result 
is that he settles for 11 categories rather than 31, which he labels A+, A, 
А-, B+, В, B—, C+, С, C—, D, and F, as shown in Table 10. The 
middle-50 percentile range of the scores in this frequency distribution is 3,5, 
much larger than the 1.33 range for four categories. This represents just 
about all the fineness of grading Joe feels he can obtain for an overall score 
on these themes. 

Joe realizes that some English classes will vary more in the quality of 
themes submitted than will others. The college-preparatory class is prob- 
ably less variable with respect to theme-writing ability than is his, a “gen- 
eral” section, so in that group the dispersion of grades would probably be 
less. This depends considerably on the grading philosophy of the individual 
teacher, of course, and on the care with which he goes over the themes, 
Some teachers give mainly B's, while others give both A-+’s and 7s. The 
teacher with the grading system most extreme in variation would be one 
who gives half the class A-+’s and the other half F’s, with no B's, C's, 
or D's. " 

Variability of theme grades is dependent upon (1) real differences in 
ability among the students, (2) the nature of the theme and the theme- 
writing situation, and (3) the skill and care with which the themes are 
graded. 

The vocabulary of variability. The middle-50 percentile range, 3.5, 
that Joe computed above is known technically as the interquartile range, 
the range between the boundaries of the upper and lower quarters of the 
group. More commonly, the semi-interquartile range, which is half of the 
middle-50 percentile range (3.5 + 2 = 1.75), is reported. This is called Q, 
the quartile demation, since it is a range within which about a quarter of 
the scores lie. ^ 

A better measure of variability or dispersion is the range of the middle 
80 percentile of the scores, called D, but it has not been used as often as Q. 
For the scores in the 11-category distribution, D — 0.9. 

The “crude” range, called simply the range, is the whole length of the 
distribution, from the bottom of the lowest score to the top of the highest 
one. Start counting with 0 and go through 10, and you will see readily that 
the rangé of the 11-саќерогу distribution is 11. The range is a relatively 
to owe measure of variability in most instances, so it is not used very 
often. 

The best measure of variability for a large variety of situations is the 
standard deviation, abbreviated ¢ or SD. Twice the SD gives the range 
within which approximately the middle two-thirds of all the scores lie. 


STATISTICAL ANALYSIS OF TEST RESULTS 73 


For the distribution of scores 0-10 in Table 10, it is 2.5. The greater the 
“range of talent” within a group of persons tested with a particular test, 
the greater the standard deviation of that group will be. A group for whom 
the standard deviation of scores is large is said to be "heterogeneous," 
while one in which all individuals earn about the same scores is “homoge- 
neous." These are relative terms, however. Virtually all groups show some 
scatter on any test. 

You would expeet the geography-test scores of pupils in Grades 4, 5, 
and 6 combined to be more variable than the scores of Grade 5 alone on the 
same test, because probably a majority of the Grade 4 students know less 
geography than the average fifth grader, while most of the sixth graders 
know more. This would cause the standard deviation of the total group 
to be substantially larger than the SD of scores for any one of the three 
grades. 

Administer a. third-grade spelling test to college sophomores and most 
of them will get all the words right, so variability of the scores will be 
slight. Give a seventh-grade arithmetic test to third graders and most of 
them will miss almost everything. In each instance the standard deviation 
of the scores will be quite small compared with the SD that would have 
resulted had the test been given to a group for which it was of adequate 
difficulty. 

The concept of central tendency. Because scores on a test vary from 
person to person, some statistic like the standard deviation is needed to 
summarize the extent of this variability without relying solely on intuitive 
study of the whole frequency distribution. Likewise, because the various 
persons tested earn different scores, it is not possible to cite any one figure 
that is completely typical of all individuals. The “modal” score or crude 
mode of the 11-category distribution in Table 10 is 3, Since more persons 
(7) received it than any other single score. 

The middle-score category is 5, but 19 scores lie below it and only 9 lie 
above. The nearest thing to a really middle score is 4, with 15 frequencies 
below and 12 above. 

Actually, the point that euts the distribution into halves, with 15.5 fre- 
quencies falling above that point and 15.5 falling below, is 3.625. This is 
known as the median. It is the statistic with which Q, the quartile deviation 
or semi-interquartile range, is usually linked. The range Q distance on each 
side of the median [3.625 + (3.5 + 2) = 19 to 5.1] is similar to the inter- 
quartile range. Here the interquartile range is 2.4 to 5.9, with a midpoint 
at 4.167, which is 0.54 higher than the median. 

When your arithmetic text referred to the average, it meant the sum of 
all the scores divided by the total number of scores. This is often called the 
arithmetic mean, or simply the mean. Like the mode and the median, it is 
а measure of central tendency, indicating the point in the frequency distri- 
bution, usually near the center, toward which the scores tend. For the 


74 THE PROBLEM OF MEASUREMENT 


ll-category distribution the mean is 129 divided by 31, or 4.2. Compare 
this with the median of 3.6 and the crude mode of 3. 

The mean grade is about th of the distance between C+ and B—. The 
median grade is $ths of the distance between C and C+. The modal grade 
is C. The mode is based upon few cases and therefore too untrustworthy 
for use except as a rough, quick estimate of central tendency. The mean 
is the preferred measure, except in certain special cases. 

Mean versus median. Suppose that the D category in the frequency 
distribution had been divided into three parts, with new low scores of —1 
and —2, as shown in Table 11. 


TABLE 11 


Drivivine THE Four D's FROM THE 
11-CATEGORY DISTRIBUTION OF 
TABLE 10 into THREE Parts 


Grade Score f 
D+ 1 2 
D 0 1 
D- —1 1 
Е —2 1 


This does not change the value of the median or the mode, but the new 
mean is 124 divided by 31, which equals 4.0, a decrease of 0.2. 

Tf a factory has 100 employees, all but five of whom earn between $2000 
and $5000 per year, and if these five executives each earn more than 
$10,000, the mean salary for the factory is likely to be misleadingly high. 
However, the median will not be sensitive to the few high salaries. In fact, 
the median can be ascertained without even knowing the actual salaries 
of the executives by just having a top category of “more than $5000" 
whose frequency is 5. The use of the median instead of the mean may be 
desirable in such instances. Sometimes the better procedure is to exclude 
the five executives from the distribution and to report their salaries sepa- 
rately. Then the 95 workers represent a more homogeneous group for which 
the mean may be used. 

An improbable anecdote should make this clear, Five men once sat to- 
gether on a park bench. Two were vagrants, each with total worldly assets 
of 25 cents. The third was a workman whose bank account and other us- 
sets totaled $2000. The fourth man had $15,000 in various forms. The fifth 
was a multimillionaire with a net worth of $5,000,000. Therefore, the 
modal assets of the group were 25 cents. This figure describes two of 
the persons perfectly, but it is grossly inaccurate for the other three. The 
median figure of $2000 does little justice to anyone except the workman. 
The mean of $1,003,400 is not very satisfactory even for the multimil- 


STATISTICAL ANALYSIS OF TEST RESULTS 75 


lionaire. If we had to choose one measure of central tendency, very likely 
it would be the mode, which describes 40 per cent of this group accurately. 
But if told that “the modal assets of five persons sitting on a park bench 
are 25 cents,” we would be likely to conclude that the total assets of the 
group are approximately $1.25, which is about $5,000,000 lower than the 
correct figure. Obviously, no measure of central tendency whatsoever is 
adequate for these "strange bedfellows,” who simply do not "tend cen- 
trally." 


D. Finding the Mode, the Median, and the Mean 


Central tendency. Characteristic of most frequency distributions is a 
tendency for the scores to bunch or concentrate somewhere near the center. 
An important statistic is, therefore, the point on the scale around which 
the scores tend to group themselves. This is a measure of central tendency. 
It is that value which typifies, or best represents, the whole distribution. 

One might wish to know which of several schools made the best record 
on a certain test, and which the poorest. To determine this, compute an 
average for each school, and then note which one has the highest average 
and which one has the lowest average. In other words, that school is best 
which on the average makes the highest score, and that school is poorest 
which on the average makes the poorest score.” 

Statisticians employ three common averages. These are the mode, or 
inspectional average; the median, or counting average; and the mean, or 
computed average. The meaning of each of these will now be considered. 

The mode. The most frequent: score is called the mode. It is obtained 
by inspection. In Table 4 on page 63 the mode is 75, because more pupils 
made that score than any other. The mode is not a very trustworthy aver- 
age, however, especially with small groups. In this case the changing of 
two scores might shift the mode decidedly. If one of the pupils who made 
75 had made 76, and if the one who made 58 had made 59, the mode would 
drop to 59, since more pupils would then have made that score than any 
other. Largely because of its fickleness, the mode is not highly regarded 
as a measure of central tendency for small groups. 

The median. Perhaps the most widely used average in educational 
measurement is the median. The median 18 the point which divides the dis- 
tribution into halves. Sometimes in an ungrouped series the midscore is used 
instead of the median. Strictly speaking, when N is an even number, there 
is no midscore. In that case, it is customary to average the middle pair of 
scores, For example, in Table 4 on page 63 there are 38 pupils, 19 of whom 
made scores of 80 or less and 19 of whom made scores of $1 or more. The 
midscore is then assumed to be the average of 80 and 81, the middle pair 
of scores, which equals 80.5. The terms median and midscore are often used 


2 Obviously, in order to be statistically and educationally significant, the difference 
between high and low schoole should be fairly large. 


76 THE PROBLEM OF MEASUREMENT 
interehangeably, but the latter should be used only for scores arranged 
in order of size rather than in a grouped frequency distribution. 


TABLE 12 
Tue Process or LOCATING THE MEDIAN 
(Scores from Table 5) 


Frequency Distribution Steps in the Process 
f ining М. 
110-114 1 Step 1. Obtaining 2 
105-109 3 N' 38 
100-104 2 v VUES 19. 
95-99 4 cola a 
PR e Я Step 2. Locating approximate median. 
80-84 6 1+1+1-+3+1+3+4+4= 18. 
This takes us up to 79.5, which is the approzimate 
18 median. 
a та ^ Step 3. Determining the correction. 
65-69 3 Е TB М 1. 
Mage l 6 х = nus 0.8, the correction. 
50-54 1 : = : = 
45-49 1 Step 4. Locating the median. 
40-44 1 79.5 + 0.8 = 80.3, the median. That is, the me- 
dian is the approximate median plus the correc- 
N 38 tion. 


Table 12 illustrates the process of locating the median in a frequency 
distribution. The median is often described as the counting average, and 
it will be noted that counting does oceupy a prominent place in its location. 
The steps may be summarized as follows: 

1. Obtain $N. That is, divide the total of the frequencies by 2. Here 
ЗМ or N/2 = 38/2 = 19. 

2. Locate the approximate median. Beginning at the low end of the distri- 
bution, count up the frequency column as far as possible without passing 
1/2, obtained in Step 1. In this case the frequencies 1 + 1-- 14-3 + 
1 +3 + 4 + 4 ріуе а total of 18. This is as far as we can go, for to include 
the next frequency, 6, would carry us too far, or beyond №/2, which is 19. 
The approximate median then is 79.5, halfway between the classes that 
include scores of 75-79 and 80-84.5 


3. Determine the correction needed. From N /2 subtract the total obtained 


the 80-84 class. Also, the 4 frequencies in the 80-84 class are considered to be evenly 
distributed over the 5-unit interval extending from 79.5-84.5. 


STATISTICAL ANALYSIS OF TEST RESULTS 77 


in Step 2. In this case, 19 — 18 — 1. This shows that one more score or 
unit is needed to obtain the required 3N scores. And this score must come 
out of the next class, the 80-84 class, where there is a frequency of 6. That 
is, we must go $ of the distance into the next class. As the class interval 
is 5, this means $ of 5, or 0.8. The correction is then 0.8. 

4, Obtain the median. This is done by adding the correction to the ap- 
proximate median. In this case 79.5 + 0.8 = 80.3, the median. 

5. Check by counting down. 84.5 — (rege x s) = 84.5 — XO = 
84.5 — 4.2 = 80.3. 

The median for the eleven-category distribution in Table 10 on page 70 
was secured by counting up in the following manner: 


1. Half of 31 is 15.5. 1 + 4+3 +7 = 15, which carries us up to 3.5. 
2. 15.5 — 15 = 0.5, the number of frequencies to be gone up into the 
class that extends from 3.54.5 and has a total frequency of 4. 


054) 05. 
з. 20) = = 01 


4. 3.5 + 0.1 = 3.6, the median. 
5. This checks with the result obtained by counting down: 


n (256 х 1) a aga 35 a 45-09 = 36. 


The median is often used as a reference point for describing the location 
of individual pupils in a distribution. A pupil in the higher half is said to be 
“above the median," and one in the lower half is said to be “below the 
median." Other points in the distribution are used in a similar manner. 
For example, quartiles divide the distribution into fourths, and deciles 
divide it into tenths, A pupil in the highest fourth is said to be “above Qs,” 
and one in the lowest fourth is said to be “below QS 

Quartiles should not be confused. with quarters (fourths) of the distribu- 
tion. Persons scoring above Q; are in the highest fourth of the group but 
not in the “highest quartile,” since this expression is meaningless. Likewise, 
pupils scoring below О: are in the lowest fourth but not in the “lowest 
quartile.” There are only three quartiles, all of which are points rather than 
ranges. Q» is the median; Qo and @ do not exist. 

Similarly, there are 9 deciles, going from 1 through 9. A person may score 
at the 2nd decile (the 20th percentile) or between the 2nd and 3rd deciles, 
but not in the 2nd decile. Rather, we would say that he scored in the third 
tenth of the group, counting from the bottom. There are no Oth and 10th 


deciles. 


‘Table 14 on page 82 illustrates the computation of the quartiles. 


78 THE PROBLEM OF MEASUREMENT 


The position of a certain pupil may be still more accurately described 
by indicating the percentage of pupils who fall below him. The points that 
divide a distribution into 100 equal divisions, or per cents, are called per- 
centiles or, more simply, centiles. 

Computation of percentiles. The median is the 50th percentile, since 
50 per cent of all the frequeneies lie below that point and 50 per cent lie 
above it. A percentile is a point in the score distribution below which the 
stated percentage of all measures lies. Thus an individual who scores at the 
30th percentile of his class has done better than 30 per cent of the students 
and poorer than 70 per cent. Percentiles are computed in much the same 
manner as the median, the only difference being that the number of fre- 
quencies to be counted up depends upon the percentile desired. 

The 30th percentile of the distribution in Table 12 is obtained as follows: 


1. 30% of N = 30 X 38 = 11.4. 1-- 1 3- 1--3 - 1 2-3 = 10, which 
carries us up to 69.5. 

2. 11.4 — 10 = 1.4, the number of frequencies to be gone up into the 
class that extends from 69.5-74.5 and has a total frequency of 4. 


4. 69.5 + 1.75 = 71.25, the 30th percentile. 
5. Check by counting down 100% — 30% = 70% of the way: 
1. 0.70 X 38 = 26.6. 1 +3 +2 +4+3+1+6+4= 24. 


. (26.6 — 24 2.6 X 5 13 
2. 74.5 pce Wear 74.5 я. 


= 74.5 — 3.25 = 71.25. 


х 5) = 74.5 


The 30th percentile in this distribution is 71.25, a point above which 
26.6 (70 per cent) of the scores lie and below which 11.4 (30 per cent) fall. 

At first it may be difficult for you to think of the 50th percentile as 
"average" because of the minimum passing percentage mark of 70 or 75 
that is quite common in high schools. Of course, there is no such thing as a 
“failing” percentile. The decision to pass or fail a given student is an arbi- 
trary one. For roughly descriptive purposes, however, the 25th and 75th 
percentiles are sometimes used as boundary lines; scores in the lowest 
fourth of a group are said to be “low,” while those in the highest fourth 
are called “high.” 

The mean. The most familiar average is the mean, often called the 
arithmetic mean. In fact, this measure is in such common use that the ordi- 
nary person regards it as the average, because it is the only average he 
knows anything about. When the term "average" is met with in ordinary 


STATISTICAL ANALYSIS OF TEST RESULTS 79 


conversation or the newspaper in such statements as "average tempera- 
ture," "average rainfall," "average yield of corn and wheat," "average 
price," and the like, it is almost certain to be the mean that is meant. The 
mean can be computed merely by obtaining the sum of the measures and 
dividing by their number. The measure so obtained is then the value that 
each individual would have if all shared equally. 

When the scores are few in number or in an ungrouped series, the simplest 
process of computing the mean is the one described above; that is, the 
scores are first added and then this sum is divided by the number of scores. 
For example, the sum of the 38 scores in Table 3 on page 62 is 3,050, and 
3,050 + 38 = 80.3. When the scores are sufficiently numerous to justify 
the use of a frequency distribution, the so-called “short” method of com- 
puting the mean may be more convenient: 


MS i X fd 
N 
Tn this formula 
M = mean, 
M' = assumed mean (the midpoint of any class), 
Хуй = sum of frequencies multiplied by their respective deviations (Z indicates 
“sum of"), 
N = total number of frequencies, 
$ = class interval. 


The method of computing the mean by the short formula is illustrated 
in Table 13. The steps in the process are as follows: 


Step 1. Asswme a mean. This is taken at the midpoint of some class. Any 
class may be selected, even though completely outside the fre- 
quency distribution, but choosing one near the center of the distri- 
bution makes the figures to be handled much smaller. In this case, 
the assumed mean is taken at the midpoint of the 80-84 class, 


which is 501-85 = gg. 
Step 2. Lay off the deviations from the assumed mean. The plus deviations 


indicate how many classes various frequencies are above the as- 


sumed mean, and minus deviations indicate how many classes 
various frequencies are below the assumed mean. This column is 
headed d. 

Step 3. Multiply each f by its corresp 
The first product is 1 X 6 = 


onding d. This column is headed f X d. 
6, the second 158 X 5 = 15, and so on. 
Step 4; Obtain the algebraic sum of the f X d e. je акн dig of 
the -+ values is 48 and the sum of the — Vau 61 
*  algebraie sum of —61 and 48 is — 13. Had the + values oe 
the — values, the sum would have been +. This is called Zfd. 


80 


110-114 
105-109 
100-104 


95-99 
90-94 
85-89 


THE PROBLEM OF MEASUREMENT 


TABLE 13 
A Snort Way To COMPUTE THE MEAN 


уха | Step 1. Assuming я mean. 82 is taken as tho 
+6 


80-84 


о 


Steps їп the Process 


assumed mean. Two parallel lines indi- 
eate the class where it is the midpoint. 


. Laying off deviations from assumed 


mean. This is the column headed 4. 


3. Multiplying cach f by its d. This column 


is headed f X d. 


75-79 
70-74 
65-69 
60-64 
55-59 
50-54 
45-49 
40-44 


M = М ХХ; 
H N 


M = 


. Obtaining algebraic sum of the f Xd 


column. Sum of + values is 48. Sum of 
— values is —61. Algebraie sum is —13. 
This is Zfd. 


M mM m Oi EU н н 


| 
go 
[^] 


= 82 — ih = 80.3 


Sos tating: proper values in the for- 
mula, as indicated at the lower left. 


Step 5. Substitute in the formula М = M' + - 1 ale E М" was found in ' 


Step 1 to be 82. The class interval, 7, is 5, since there are five whole 
numbers in each class: 40-44 means 40, 41, 42, 43, and 44. fd, 
found in Step 4, equals —13. N = 38, the total number of scores. 


B= 82 ~ 1.7 = 80.3. 


Tt will be recalled that the mean when computed by the “long” method 
was 80.3, exactly the same as when computed by the “short” method. 
In other similar problems a discrepancy may occur due to the fact thate the 
former method is based upon the actual value of the scores, whereas the 
latter method is based on the assumption that the midpoint of each class 


STATISTICAL ANALYSIS OF TEST RESULTS 81 


is the average for all the scores in that class, an assumption which is often 
only approximately true. The difference in the result is usually so slight 
as to be of no practical importance. 

What average is best? As a rule, the mean is regarded as the best measure 
of central tendency and the mode the poorest. The mean, however, is greatly 
influenced by extreme scores, and whenever it is desired to avoid this in- 
fluence, the median is to be preferred. As such situations often arise in 
educational measurement, the median is widely used. For example, if the 
test is too difficult, there may be several zero scores; and if the test is too 
easy, there may be several perfect scores. But in neither case are the pupils 
at the extremes correctly measured. The median in some such situations is 
the best average to use. The median is also easier to find than the mean, 
unless an electric calculator is available. 


E. Measures of Variability or Scatter 


Meaning of variability. No distribution is completely described by its 
average or central tendency. Two classes in a school might have the same 
average intelligence and yet be very unlike. The members of one class may 
vary all the way from feeble-mindedness to the genius level, while all the 
members of the other group may rate as normal. Obviously, these two 
classes present different instructional problems because they differ in vari- 
ability. Variability is the extent to which the scores tend to scatter or spread 
above and below the average. It is clearly important to have some con- 
venient method of indicating the variability of a group. There are three 
common measures of variability: the range, the quartile deviation, and the 
standard deviation. All these measures represent distances rather than 
points, and the larger they are the greater the variability or scatter of the 
Scores. 

The range. The range has already been referred to as the distance be- 
tween the lowest and the highest scores plus one.’ It is usually a very 
untrustworthy measure of variability. The shift in a single score may 
greatly alter the range and thereby materially increase or reduce the ap- 
parent variability of the group. School A and School D in Table 6 on 
page 66 illustrate this possibility. : : 

The quartile deviation. A measure of variability that avoids being 
unduly influenced by extreme scores is the quartile deviation, or Q. This is 
one half the distance between the first and third quartiles. For this reason, 
it is often referred to as the semi-interquartile range. Since 25 per cent of 
the scores fall below the first quartile, or Q:, and 25 per cent of the scores 
exceed the third quartile, or Qs, the interquartile range is the range of the 


* More precisely, the range is the difference between the lower real limit of the low- 


est class and the higher real limit of the highest class. 


* When there are only two measures (№ = 2), however, the range gives all the informa- 


tion concerning variability that the distribution can yield. 


82 THE PROBLEM OF MEASUREMENT 


middle 50 per cent of the scores. The whole interquartile range might be 
used to express the variability of the group, but it is customary to take 
only half this distance and to set up a new middle-half range extending 
from Q below the median to Q above the median: Mdn + Q. As already 
noted, the middle of the interquartile range will not usually be the median, 
while the middle of this new range will always be. On the other hand, 
exactly half of all the frequencies lie within the interquartile range and 
exactly half outside it, but this does not hold precisely for Mdn + Q. 
The formula used for obtaining Q is 


75th percentile — 25th percentile 


mg 
fec ahi 2 


TABLE 14 


Tue Process or COMPUTING THE QUARTILE DEVIATION, Q 


Frequency Distribulion Steps in the Process 
Ј Step 1. Computing Qi, the 25th percentile. 
110-114 1 EN = ł of 38 = 9.5 
105-109 3 Counting up: 1 +1+1+3+1=7; 
ie 2 approximate Q; is 64.5. 
95-99 4 D р 
931 28:29 x pa 120 дут, 
28 3 3 
90-94 3 correction. 
85-89 1 64.5 + 4.17 = 68.67, Qı. 
80-84 6 
75-79 4 Step 2. Computing Qs, the 75th percentile. 
70-74 4 ЗМ = ł of 38 = 28.5. 
65-69 3 Counting up: 1+14+1+4+3+4+14+3+4+4+4 
+6+1+3 = 28; 
7 Approximate Q; is 94.5. 
60-64 t 
55-59 3 28.5 — 28 = 0.5; 05 <5 = 20 = 0.62. 
50-54 1 = E 
45-49 1 94.5 + 0.62 = 95.12, 03. 
40-44 1 
Step 3. Substituting in formula. 
N 38 Ф 9 
Formula: Q = Re: 
Substituting: Q = — = 13.2. 


Table 14 illustrates the computation of Q. It will be observed that the 
process of locating quartiles is like that of locating any other percentile. 
In the first step, the fractional part of N indicates the proportion of the 
distribution which falls below the desired point; that is, for Q, it is IN and 
for Q; it is $N. There are four steps, as follows: 


STATISTICAL ANALYSIS OF TEST RESULTS 83 


Step 1. Compute Q;, the 25th percentile. To begin with, } of 38 is 9.5. 
The next three steps in locating this point are exactly the same as 
those in locating any percentile. 

Step 2. Compute Qs, the 75th percentile. Here the first step is to take 2N ; 
1 of 38 is 28.5. The other three steps are identical with those in 
locating any percentile. 

Step 3. Substitute in the formula. Q; is 95.12 and Q, is 68.67. The difference 
between them is 26.45. Half of this difference is 13.2, the value of Q. 

Step 4. Check your work by counting downward. 


The interpretation of Q and other measures of variability is a relative 
matter. Whether a Q of 13.2 is to be considered great or small depends upon 
the magnitude of comparable measures for other groups using the same test. 

The standard deviation. A third measure of variability, which has 
many uses in educational measurement, is the standard deviation. It is 
usually represented by the Greek letter c, called sigma, aud defined as the 
square root of the mean of the squares of the deviations of the scores from 
their mean. It may also be defined as that range above and below the mean 
(M + 1e) that in a normal distribution’ includes 68.26 per cent, or 
approximately two thirds, of the scores. 

The formula for the standard deviation when computed from an as- 
sumed mean for scores in a frequency distribution is 


Р : — 3 
o = iAWXGIB - GI o = С 

The computational process is illustrated in Table 15. It can be seen that 
the only term not used in the computation of the mean in Table 13 is 
Zfd, the sum of each frequency times the square of its respective devia- 
tion. The steps needed for computing the standard deviation are as 
follows: 


Step 1. Assume that the mean falls at the midpoint of a certain class, say 
80-84. 

Step 2. Lay off the deviations above and below the assumed mean. 

Step 3. Multiply each f by its d. NA. 

Step 4. Oblain Sfd, the algebraic sum of the fd column. Here it is — 13. 

Step 5. Prepare the d X (f X d) column. Each entry in this column is the 
product of a d and the f X d to its right. 

Step 6. Obtain Xfd?, the sum of the d X (f X d) column. All values in the 
d X (f ха) column are positive, since negative deviations are 
squared. Their sum is 479. 

Step 7. Substitute in the formula for о. 


————— 


? This particular type of frequency distribution is discussed on pages 260 and 290. 


84 THE PROBLEM OF MEASUREMENT 


TABLE 15 


A ЭЗтмрывтьр Way то COMPUTE THE STANDARD DEVIATION 


Computation Steps in the Process 
f d fxd dX(f Xd) | Step 1 Assume a class for the mean (80-84). 
110-114 1 +8 +6 36 Step 2. Lay off deviations from assumed 
105-109 3 +5 415 75 mean. 
100-100 2 +4 +8 32 Step 3. Multiply each f by its d. 
95-99 4 +3 +12 36 Step 4. Obtain fd, the algebraic sum of the 
90-94 3 +2 +5 12 J X d column. Here it is —13. 
85-89 1 +1 + 1 Step 5. Prepare the d X (f X 4) column. 
Fach entry is the product of d and 
80-84 6 0 the f X d opposite it. 
-———| Step 6. Obtain 274°, This is merely the sum 
75-79 4 -1 -4 4 of the d X (f X d) column. Here it 
70-744 '4 —2 à —8 16 is 479. 
65-69. 38^  —3  -—9 27 Step 7. Substitute in the formula, as indi- 
60-64 1—4» —4 16 cated. 
55-59 3 —5  —15 75 
50-54 1 -6 -6 36 
45-49 1 -—7 -—7 49 
40-44... 1., 58, ..—8 64 
№ = 38 48 479 
—61 хуа? 
2 = —18 


wy cl i let a t 


38 


, _ УМ = Gy? _ 5М880070) — (-19) _ 5V18202 — 169 
N 38 


5V1B033 51943 671.5 
= 33 = 17.7 


38 38 


D, a useful percentile measure of variability. c may be estimated 
fairly accurately and easily by means of the formula 


= 0.4 X D = 0.4 X (90th percentile — 10th percentile). 
For the scores in Table 15, 0.4D would be secured as follows: 


Step 1. Oblain the 90th percentile, which is a point that lies 10 per cent of 
the way down into the score distribution. 10 per cent of 38 = 3.8, so 


the 90th percentile = 109.5. -- в = х 5) 2100/8 == 28x5 
= 109.5 ~ 150 = 109.5 — 4.67 = 104.83. 


Step 2. Obtain the 10th percentile, which is a point that lies 10 per cent 
of the way up into the score distribution. The 10th percentile = 


из [PEELED 5] - 545-06 X 8 Leg + 2 
= 54.5 + 1.33 = 55.83. 


STATISTICAL ANALYSIS OF TEST RESULTS 85 


Step З. Substitute in the formula с = 0.4D. в = 0.4 X (104.83 — 55.83) = 
0.4 X 49 = 19.6 = 20. Contrast this with the 17.7 value of о in 
Table 15, which rounds off to 18. 


Though Q is used much more freauently than D, the latter is a far better 
percentile measure of variability and is considerably easier to compute. 

Practical uses of the standard deviation. The standard deviation is 
the most important measure of the variability of test scores. A small standard 
deviation means that the group has small variability, or is relatively homo- 
geneous, while a large standard deviation means the opposite condition, 
heterogeneity. с also has certain other important uses. 

The position of a pupil in a distribution is often represented in terms of 
standard deviation units. In the distribution used in Tables 13 and 15, 
where the mean is 80 and the standard deviation is 18, a pupil whose score 
is 98 is said to be one standard deviation above the mean, and the score 
written +1c. In like manner, a pupil whose score is 62 is said to be approxi- 
mately one standard deviation below the mean, and the score is written 
— 1с. Such scores are called standard scores or z-scores.* 

Which measure of variability is best? As a rule, о is regarded as the best 
measure of variability; and the range is undoubtedly the poorest. The 
range is subject to all the limitations which the mode has as a measure of 
central tendency. Just as the mean is greatly influenced by extreme scores, 
so is с. Whenever it is desirable, therefore, to avoid the influence of extreme 
scores, the median is employed as a measure of central tendency, and with 
it a percentile measure of variability such as Q or D. In like manner, when 
the mean is used, ø is the appropriate measure of variability. 


F. Measures of Relationship 


The concept of co-relationship or concomitant variation. During 
the latter part of the nineteenth century, Sir Francis Galton and the pio- 
neer statistician Karl Pearson succeeded in developing the theory and 
mathematical basis for what is now known as correlation." 'They were con- 
cerned with relationships between two variates, for example, height and 
weight. It is easy to note that tall persons usually weigh more than short 
ones, suggesting that above-average height tends to go with above-average 
weight. Height and weight vary together. though certainly not perfectly 2 
there are “Беапроез” and “five by fives” to upset the relationship. It 
would be possible to select a group of individuals in such a manner that 
the taller the person is the less he weighs, but this negative relationship 
between height and weight is not to be expected for individuals picked at 


random. 
ee 


* For a fuller discussion of z-scores, see page 289. In i 
ЗА fascinating account is contained in Helen M. Walker, Studies in the History of 


Statistical Method, Chapter V. Baltimore: The Williams & Wilkins Company, 1929. 
The concepts of pisc and regression are treated together in most statistics texts, 
but for the sake of simplicity, regression is not mentioned in the present discussion. 


86 THE PROBLEM OF MEASUREMENT 


Let us examine some other factors which normally vary together. There 
is a substantial, but again by no means perfect, positive correlation be- 
tween intelligence test scores and average grades earned during the fresh- 
man year of college. The higher the score obtained by the entering fresh- 
man, the higher his grades are likely to be. The lower an individual's score, 
the poorer a student he will probably make. This relationship has been 
found with all sorts of intelligence tests used in a great number of different 
type colleges ever since such tests first became available commercially 
shortly after the close of World War I.” 

Husbands and wives tend to be more like each other with respect to age, 
amount of education, and many other factors, than they are like people 
in general. The sons of tall fathers tend to be taller than average, and the 
sons of short fathers tend to be short. Likewise, the fathers of tall sons tend 
to be of above average height. Children resemble their own parents in in- 
telligence more closely than they resemble other adults. Positive correla- 
tion between members of families is usually found for almost any charac- 
teristic from algebra ability to knowledge of zoology. 

Using Galton’s ideas concerning trait resemblances, Pearson devised as 
a measure of relationship the product-moment coefficient of correlation, r. 
Since about 1900 this has been a widely employed statistic. In the testing 
field it has become almost indispensable. Virtually all test manuals are 
plentifully sprinkled with r's, as is most educational literature. The class- 
room teacher frequently encounters r's in his reading and conversation. 
For these reasons, both the concept and the computational procedure are 
well worth mastering. 

Pearson's original r and several other related r’s summarize the magni- 
tude and direction of the relationship between two sets of measurements, 
such as height and weight based upon the same persons, or between meas- 
urements on pairs of persons, like the fathers and sons mentioned above. 
It makes no difference whether the variates are history grades and geogra- 
phy grades, or speed of running the hundred-yard dash and skill in playing 
the violin, or speed of tapping and age. In every situation, r can have 
values that range from — 1 for a perfect inverse relationship through 0 for 
no systematic correlation to +1 for perfect direct relationship, and the 7’s 
between radically different kinds of variates are wholly comparable. For 
example, it is meaningful to say that reading ability and intelligence are 
more closely related than height and weight. 

In order to compute r without an electric calculator, it is helpful to have 
a scatter diagram such as Table 8 on page 68. However, those scores are 
too simple to illustrate computational procedures well, since not more than 


№ One of the most complete summaries is Harley F. Garrett, “A Review and Inter- 
pretation of Investigations of Factors Related to Scholastic Success in Colleges of Arts 
and Seiences and Teachers Colleges," Journal of Experimental Education, 18: 91-138, 
December, 1949, ] 


STATISTICAL ANALYSIS OF TEST RESULTS 87 


a single individual falls into any one cell of the scattergram. The г between 
MA and EA for Table 8 is .65, which indicates a moderate positive rela- 
tionship. 


TABLE 16 


А SCATTER DIAGRAM ILLUSTRATING NEGATIVE CORRELATION (г = —.35) Between 
CHRONOLOGICAL Аав AND EDUCATIONAL Ace ror 20 Еснтн GRADERS 


(Data from Table 7) 


Chronological Age (CA) in Months EA 
141-144 |547-|150-153-156-|159-|162-1165-[168-1171-[174-|177-| Fre 
143 |146 |149 |152 |155 |158 |161 [164 |167 [170 |173 |176 |179 "e 
188-189 1 1 
186-187 1 1 
emp T 
184-185] 1 1 
Š [182-183] 1 2 3 
5 i Г F f; E 
= [180-181 1 1 2 
2 liear ў : 
< 178-179 i T 
X 176-177 Py ni 1 4 
x ; а 
з 174-175 2 2 
5 
= 172-17: 1 1 
Re mam | 
З 
1 1 
3 170-171 ie 
108-169 
166-167 3l 1 2 | 
164-165) 1 1 
CA : 
Frequency | 1 i s2320|x50 25:1. 2 ib REO 


rated in Table 16, using the 20 chronological- 
s from Table 7, page 000. Since the older 
children in a school grade are usually less able students than the younger 
ones, pupils with high CA's tend to have low EA's, and those with low CA's 
tend to have high EA's. That this relationship in Table 16 is rather weak 
is indicated by the low magnitude of the r, which is —.35. The oldest, pupil 
has a СА of 179 and an EA of 171, while the youngest pupil's CA is 141 
and his EA 183. However, one of the younger pupils secured an EA of 


only 167, 


Negative correlation is illust 
age and educational-age score 


88 THE PROBLEM OF MEASUREMENT 


Two urgent cautions are imperative. First, r cannot be interpreted directly 
as a percentage. An r of 0 represents no relationship at all, but an r of .65 
does noi mean 65 per cent relationship. In one sense, the difference in 
degree of relationship represented by 7’s of .91 and .98 is as great as the 
difference between 7's of 0 and .65. As 7’s get larger, a small gain indicates 
a considerable increase in the degree of correlation. Therefore, an r of .66 
indicates more than twice the relationship shown by an r of .33. These 
relative magnitudes are depicted graphically in Figure 4,1! where the curve 
for negative r's is symmetrical with the one for positive r's. Thus a given 7 
indicates a high degree of relationship if its magnitude, regardless of sign, 
is large. An r of —.72 denotes just as strong an inverse relationship as an 7 
of .72 indicates direct covariation; both have equal predictive value. Note 
that the two curves of Figure 4 are approximately linear through about 
the first fourth of the r scale, after which they become increasingly posi- 
tively accelerated. An г of .24 corresponds to a 2 of .24, while an r of .995 
corresponds to a z of 3.00. 

The second warning is that correlation does not necessarily mean causation. 
Often variables other than the two under consideration are responsible for 
the association. Furthermore, problems in the social sciences, the field in 
which correlation is most used, are usually too complex to be explained 
in terms of a single cause. 

Let us take several examples. It is probably true that in the United 
States there is moderate positive correlation between the average salaries 
of teachers in various high schools and the percentages of their graduates 
who go on to college, but to say that these students attend college because 
their teachers are well paid is as inaccurate as to say that their teachers 
are well paid because many of the graduates attend college. The situation 
is complex, but one prominent factor is the financial condition of the com- 
munity, which to a considerable extent determines ability to pay both 
teachers’ salaries and college expenses. 

Furthermore, it has been found that the percentage of “dropouts” oc- 
curring in high schools varies inversely with the number of books per pupil 
in the libraries of those schools.” But common sense tells us that piling 
more books into the library will hardly affect the dropout rate, nor will 
getting a better attendance officer bring about a magical increase in the 
number of books. 

Failure to recognize the non-causal nature of correlation is, in its broadest 
sense, a widespread logical error, for the fundamental notions of co-rela- 


" Based upon Ronald A. Fisher, Statistical Methods Jor Research Workers (Tenth 
Edition), Table V.B., page 210. London: Oliver and Boyd, 1948. The vertical (ordinate) 
figures, ranging from 0 to 3.0, are Fisher's #transformation values of the г scale. They 
are not the same as the z-scores mentioned on page 85. 

12 Рог several such relationships, see Guy V. Ferrell, High School Holding Power-— 
An Analysis of Certain Internal Factors, Unpublished Ph.D. Dissertation, George Pea- 
body College for Teachers, 1951. 228 pages, 


89 


FISHER'S 2 SCALE 


STATISTICAL ANALYSIS OF TEST RESULTS 


$9*59634045;9995598$95N35 


si 


or o9 OG Ov Of O2 ог о, Ol- O2- Of- tu 05- 09- 


*sozig snonmmA Jo S4 Áq pojuoso1dog drqsuon р о junoury эл IJL “F əm 
4 


òr- o8- 


H 


90 THE PROBLEM OF MEASUREMENT 


tionship affect our lives at many points. Going to Sunday School is gen- 
erally agreed to be valuable from many standpoints, but a positive rela- 
tionship between the rate of attendance and a characteristic such as 
honesty does not necessarily imply that children are honest because they 
attend Sunday School. Underlying both attendance and honesty may be 
“home training," for example. A really crucial test of the hypothesis that 
attending Sunday School “makes” children more honest would have to be 
experimental rather than correlational. 

Note carefully that while correlation does not directly establish a 
“causal” relationship, it may furnish clues to causes—and these can be 
taken advantage of in planning controlled experimentation. Therefore, r is 
useful primarily for exploratory purposes. Hence, it is employed much more 
widely in the newer sciences such as sociology, psychology, and education 
than in physics and chemistry. 

Various ways to obtain r. There are many devices for computing r. 
One of the simplest methods requires an electric calculator and utilizes the 
"raw" scores themselves, without any frequency distributions or scatter- 
grams. However, it is seldom feasible unless a calculator is available, be- 
cause the arithmetical operations involve large numbers and therefore 
become quite tedious. Almost all other computational procedures require 
a scattergram. Many routinized “correlation charts” are available com- 
mercially, and nearly every statistics teacher has his own pet version, 
usually to some extent original. 

This diversity of computing aids is probably indicative of basic diffi- 
culties inherent in the process of computing r “by hand.” There is not any 
really simple way to do it. The chief difficulty is that r indicates a relation- 
ship between paired scores, so in order to obtain r without undue labor one 
must by some indirect method find the sum of the products of the paired 
scores. That is the intent of every simplified procedure. 

One complicating factor in the attempt to simplify the computation of r 
is that in order to make the numbers involved as small as possible, most 
chart designers set up procedures that result in many negative numbers. 
Negative numbers are likely to confuse the average student, while large 
positive ones are tedious to handle. Both invite sizable errors. In the 
writers’ opinion, it is better for persons who are getting their introduction 
to r in this book to work with somewhat larger positive numbers rather 
than with smaller negative ones, since, thereby, the explanation is simplified 
considerably and the chances of misunderstanding the procedure reduced. 
If you already know how to compute r from a scattergram by some other - 
method, and know it well, use it rather than the one described below." 

Constructing the scattergram. In order to get some “live” data, one 
of the authors took 43 sets of classroom test scores from his roll book. 


BA thorough explanation of one procedure is given by Heien M. Walker, Elementary 
Statistical Methods, pages 229-235. New York: Henry Holt and Company, 1943. 


STATISTICAL ANALYSIS OF TEST RESULTS 91 


These are shown in Table 17. Note there that the X scores represent num- 
ber of points (not percentage right) on a pretest at the beginning of the 
quarter, while the Y scores were secured at midterm. 


TABLE 17 


Pretest AND Мтотевм Scores ок 43 GRADUATE Srupents on Two TEACHER-MADE 
Opsective Tests IN INTERMEDIATE STATISTICS 


Midterm Midterm 
Pretest Examination Pretest Examination 
Student Score Score Student Score Score 

(X) (у) (Х) (Y) 

a 62 51 v 62 65 
b 55 66 w 53 56 
(2 55 40 zl 62 56 
d 49 38 Y 49 54 
e 46 51 2 62 56 
J 67 57 aa 44 39 
g 32 42 bb 60 49 
h 42 35 ce 47 55 
D 67 61 dd 44 45 
j 55 46 ee 49 39 
k 44 33 ff 36 38 
l 46 58 09 40 15 
m 37 48 hh 53 50 
n 57 44 di 44 47 
о 34 55 Ji 49 4 
p 57 44 kk 49 35 
q 58 59 il 44 43 
T 58 45 mm 53 60 
s 49 51 nn 53 25 
t 40 32 00 42 35 
и 73 54 pp 57 45 
у qd 49 41 


The highest score in the X distribution is 73 and the lowest 32, so the 
range is (73 — 32) + 1, or 42. Since 4$ = 3.5, and $$ = 2.8, it would be 
appropriate to use a grouping interval of either 4 or 3. Trying 4 and starting 
the whole-number lower limit of the lowest class with 32, which is a mul- 
tiple of 4, we find that the classes run from 32-35 to 72-75 and that there 
are only 11 classes. Since our arbitrary rule is to have not fewer than 12 
nor more than approximately 15 classes, we need to use the smaller in- 
terval, 3. With it there will be 15 classes running from 30-32 (30 is a mul- 
tiple of 3) to 72-74. 

Repeating this kind of procedure for the 
range from 66 to 15, results in a grouping in 
from 12-15 (12 is a multiple of 4) to 64-67. ; 

Now construct a 15 X 14 scattergram, as shown in Table 18.4 Put the 


18 is a slight variation of one described in Quinn 
ges 94-90. New York: John Wiley & Sons, 1949. 


Y distribution, where the scores 
terval of 4. The 14 classes run 


“The method set forth in Table 
McNemar, Psychological Statistics, pa 


gg: = S99 9I. (TTD) (ZET) _ O08 GIA SELLA 


9122. 9TZL SLL 
(098) — (oos's)erl, [:(182) — (вел (9р) — рух lp) — PIRN] A I. 
sse — (т —— (РУЗ) СР — "PEN $ 


jsejeig uo э1059 = X 


uorvururexrr 


WIP чо 91095 = д 


92 


4р"рх yoo, Vp "рук N FL | TL | 89 | S9 | 59 | 68 | 9e] $6 | OS | ZF | FF | iE | $8 | GE 

BESS 18% = *p*/a 0088 098 [94 -Z4 | —69 | -99 | -£9 | -09 | -/€ | +6 | -19 | -SF | 2+ |-z* | -6s | -98 | -ee 

aS Te a a лар СТ 0 0 0 I I 

0 =O XI 0 0 0 1 0 

0 =0 х= о 0 0 z 0 

0 =0 Xe 0 0 0 $ 0 

9 -—g xv» 0 0 0 T 0 

201 = 10 Хо т 9+8 ет 96. € $ E CN Si 

S01 —81X9 SI =21 +++ FFT % 9 * Z T I 

$90 = Хі W =8+21+F+0 erc w EE ¢ I Z I 

or —-c0X8 c9 =98+8+8 ЗҮР 9с 8 21 LEE! C Rud 

098 = 0PX6 OF =O +L+9+9+4+2% 985 9 6 9 Z гаи ie I 

ore = 99 X 0L 5 = и +2+9+841 009 og or 5 I Sa ee fat I 

$89 = £9 X IL. ef -cI-c02--6-Fi-c9 922 99 т 9 І т І т 

355 — 61 XcI 6 = (1х1 + (1х  ssz ji TID H т 

ус -— SIX et SI = (01 X D -(SXT) see Гаа ч Е. Т I 

ap XY) X p PROS оло K T Cp x V) X'p ФХУ p E 3 но Меге ве | о | ie tsa] о to | vo 
E а So 0 В BH [ucc | co dno | m 
мы 
ЕИ ЕУ ИЕ ЕЕЕ Е 
Xy о |м te | ee [ow = 
Еве slelslole 


(LT әдет, шолу 394095) 
SISAL запозгяО иауу-наноузт, OM, NO ѕннооє NIAMLAA 4 AHL AO NOLLVLNAWOD 


SI ATAV.L 


STATISTICAL ANALYSIS OF TEST RESULTS 93 


pretest scores across the bottom, from low to high, and the midterm scores 
in ascending order on the left side of the figure. Then, using the scores in 
'Table 17, put a tally (/) in the proper cell for each pair of scores. Find the 
person's X score along the bottom of the scattergram and then go up in 
that column until you are directly opposite his Y score. Put a tally in that 
cell. Do this for each individual, and you will have a total of 43 tallies, 
for there are 43 pairs of scores. 

For example, the lonesome-looking entry in the cell designating an X 
score of 30-41 and & Y score of 12-15 indicates the student who earned 
40 points on the pretest but only 15 at midterm. The highest score on the 
pretest was 73, earned by the person who made 54 on the midterm exam; 
this is shown by the tally at the far right. 

After the tallying is over, change the tallies to numbers. 

Computing r from the scattergram. Now you are ready to find r. 
Set up the four new rows at the top of the scattergram and the six col- 
umns at the right. The first row is for fz, the frequency of X scores in 
each column of the scattergram. These are 1, 1, 12-12 2, 13-1 = 2, 
3+1+1-+2 = 7, and so forth. The first column to the right of the 
scattergram is for fy, the frequency of Y scores in each row of the scatter- 
gram. These are, from top to bottom, 1 +1 =2, 1+1=2, 1+1+ 
14-2 +1 = 6, and so on. 

The second row at the top, set in bold-face type and labeled d+, shows 
deviations from the arbitrary origin, which in this case is ise T ch aL 
the midpoint of the lowest X class. These d. values start at 0 and go by 
I's up to 14. The d, column at the right begins at the bottom with 0 and 
goes by 1’s up to 13. Obviously, the numbers continue by 1’s until the 
highest class is reached. How high they go depends upon the number of 
classes for the X and Y variables in the scattergram. The highest number 
for d, will always be one less than the number of X classes, and the top 
number for d, will always be one less than the number of Y classes. 

The rest of the procedure is self-explanatory, if the meaning of the sym- 
bols is understood. The third (56-59) row of the scattergram will serve 
to show how the “f, X 4, row sums" are obtained. There are frequencies 
in five different cells of that row. These are f.’s for the row. To get the 
1. X dz row sum for this third row we take the 1 farthest left and multiply 
it by its d, of 5. The next 1 is multiplied by its dz of 7, the next 1 by 9, 
the 2 by 10, and the 1 farthest right by 12. (1 x 5) + agx7+(0 x 9 
+ (2 X 10) + (1 X 12) = 53, the figure shown in the “fs X ds row sums 
column. 

There are two checks in the table, The sum of the f; row at the top should 
equal the sum of the f, column at the right, since both equal N, the number 
of pairs of scores. Also, the third row at the top sums to Zf.d., which is 
the same аз the sum of the "f, X dz row sums” column at the right. 


94 THE PROBLEM OF MEASUREMENT 


Because there are no other automatie checks, all computations should be 
gone over carefully, preferably by someone besides the original computer. 
Substituting in the formula is straightforward, if one simply follows the 
symbols. Each of the square roots in the denominator should be carried 
out to four figures and rounded back to three for computational ease (see 
Appendix D, pages 456-458). You may find a table of square roots helpful 
at this point. The final r should not usually be reported to more than 
two figures, unless it is based upon several hundred individuals. 
Computing the two means from scattergram data. On page 79 a 
iX fd 
N 


formula for the mean was given as M — M' 4- - For the X scores 


in Table 18, the assumed mean, М” (or, more properly in this example, the 
arbitrary origin), falls at the middle of the 30-32 class, which is (30 + 32) 
divided by 2, or 31. The formula for the mean of the X distribution is 


le X 21.4. 304-32 3 X 281 843 
Е S NU MILES T 


Similarly, the mean of the Y distribution is 


iy X Ура, 12--15 , 4x 360 1440 — 
Seo a hes - 44 


M, = М, + 


Of course, this does not mean that the students knew less at the midterm 
than when they began the course. Had the Y test been given at the first 
of the quarter, the average score would probably have been only slightly 
above zero, since the midterm examination covered much more advanced 
material than the pretest. 

Computing the two standard deviations from scattergram data. 
The formula for obtaining the standard deviation from grouped data that 
was explained on page 83 quickly yields standard deviations for the X 
and Y distributions, since the two square root values have already been 
obtained as the denominator, (132)(111), in the Table 18 computation of r: 


_iNN3fd2 = С _ 3X132 396 


ae N 43 uw 
2, = ENANZA? = Qf 4X1 _ 444 _ 
V N аА P PE ER 


Determining the medians. It may be helpful at this point to review 
the computation of the median by obtaining this statistic for the X and 
Y distributions by the method explained on pages 76-77. The median is the 
50th percentile. Fifty per cent (or half) of 43 is 21.5. Look at the first row 
at the top of the scattergram and count f, frequencies from left to right: 
1+1+2+2+7+3 = 16. This leaves you “suspended” between the 
X classes 45-47 and 48-50, or at 47.5. Now count 21.5 — 16 = 5.5 units 
into the 48-50 class, which has a frequency of 7 and an interval of 3 


STATISTICAL ANALYSIS OF TEST RESULTS 95 


© 


(because it includes scores of 48, 49, and 50). = ХЗ = Me = 2.4, and 


47.5 + 2.4 = 49.9, the median of the X (pretest) distribution. Check by 
1.5 


counting the other way: 50.5 — [5 x 3) = 49.9. Compare this median 


| 
lE 


with the mean, which is 50.6. 
The median of the Y (midterm) examination is obtained similarly by 
way of the f, frequencies in the first column at the right of the scattergram: 


43.5 + E 2d 4) = 47.2, Check by counting down: 47.5 — ne X4 = 47.2. 


7 
This is about the same as the Y mean of 47.0. 

Rank correlation. If the measures to be correlated are consecutive, 

untied ranks 1, 2, 8, and so on, through N, the r formula reduces to 

хр? 
1- у = 0) 
ranks апа N is the number of pairs. An illustration will show how simple 
the computational procedure is. 

A certain test question requires that six historical events be ranked in 
chronological order, 1 representing the earliest and 6 the most recent, 
without ties. Table 19 shows the events, the ranks assigned by Richard 
Roe and by John Doe, and the correct ranks. The first r is secured by 
summing the squared differences between Richard’s ranks and the correct 


, where D represents the positive difference between paired 


TABLE 19 


Tun Compuration оғ Вно rog Eacu or Two Srcpents Мно ARRANGED 
Sıx Historican EvENTS IN CHRONOLOGICAL ORDER 


Ў John А 
Events to Be Ranked iod Correct Do eh Correct 
(1 — earliest) Bonis Ranks | p | pe | Ranks Ranks | p | pa 
French Revolution 2 6 4 | 16 5 6 1 1 
Magna Charta 4 3 ji 1 4 3 1 1 
Pompeii Destroyed by 
Vesuvius 3 1 2 4 2 1 1 "n 
Columbus Discovers 
America 1 4 3 9 3 E 1 1 
Fall of Roman Empire 5 2 3 9 i! 2 1 1 
Spanish Armada 
Destroyed 6 DT. 1 1 6 5 d 1 
N26 хр? = 40 хр? = 6 
бхра 6 x 40 b anus $x6 
e (tho) =1— ар  $Xl6X9-1 $ x 35 
6 
-i-S-ir-iMe-H -1-z 
1—.17 


ии 
оо 
e 


96 THE PROBLEM OF MEASUREMENT 


ranks and substituting in the formula. This r, usually called rho and writ- 
ten as p, equals —14. Like r, rho can vary from —1 through 0 to +1. 
The negative rho found for Richard probably indicates poor guessing 
rather than actual misinformation. Note that John Doe, with a rho of .83, 
seems to have had a fairly good general knowledge of the correct chronol- 
ogy, even though he “missed” every one of the six ranks 


TABLE 20 


Tue Various VaLuES or Ено (р) ron ALL PossrsnLE Sums or SQUARED 
Deviations (2D?) кок N's ғвом 2 THROUGH 10 


ah he pe ee 
N(N? — 1) 
Sum of Sum of 
Squared Number of Things Ranked (N) Squared 
Devia- Devia- 
tions, |——— tions, 


хр? 2 3 4 5 6 7 8 9 10 =D? 


STATISTICAL ANALYSIS OF TEST RESULTS 97 


TABLE 20 (Continued) 


Tur Various VaruEs or Rmo (р) ғов Aut Розатвые Sums оғ SQUARED 
Deviations (XD?) ғов N's ғвом 2 THROUGH 10 


fus rico TE 
N(N* — 1) 
PSS Т2. 0 сос зь И 
хр 7 8 9 10 хр? т 8 9 10 
72 —.29 14 40 56 | 122 —45 —.02 26 
74 — .32 12 38 55 124 —48  —.03 25 
76 —.36 0 .37 54 126 —.50 —.05 24 
78 | —39 07 .35 53 | 128 —.52 —.07 .22 
80 —.43 .05 33 .52 | 130 —55  —.08 21 
82 —.46 102 .32 .50 | 132 —57  —.10 20 
84 —.50 0 .30 49 | 134 —60  —.12 19 
86  —54 —.02 28 48 | 136 —62  —.13 A8 
88 —.57 —.05 27 AT 138 —.64  —.15 16 
90 —.61  —.07 .25 45 140 —07 —.17 15 
92 —.64 —.10 23 44 | 142 —69  —.18 14 
94 —.68 —.12 22 43 | 14 —71 —.20 A3 
96 —.71 —.14 .20 42 | 146 —14 —.22 2 
98  —475  —AT 48 41 | 148 —76 —.23 ло 
100 —.79 —.19 Е .39 150 —419 -—25 .09 
102  —82 —.21 45 38 | 152 —B81 —.27 .08 
1044 —.86 —.24 ЛЕ .37 | 154 —83. —328 107 
100  —.89  —.26 2 .36 | 156 —86  —.30 05 
108 —.93  —.29 10 .35 158 —.88 —.32 04 
110 —.96 —.31 .08 .33 160 —90 —.33 03 
112° —1.00 = 330 10 .92 162 —.93  —35 02 
114 —.36 05 .31 164 —,95 —.37 01 
116 —.38 .03 30 | 166 —98 —38 —.01 
118 —.40 .02 28 | 108 -100 —40 —02 
Table 20 makes the computation of rho itself unnecessary when the 
number of pairs is less than 11. Just compute ZD?, look in Table 20 for 


this figure, move over to the appropriate № column, and read the value of 
rho there. For example, the 2D? for Richard Roe was 40. Looking in the 
XD? column at 40 and then to the right under the N of 6 we find that rho 
is —.14, which agrees with the value computed in Table 19. 
A simplified scoring procedure for “sequence” items is described in 


Appendix C on page 455. 


Strictly speaking, the rho formula is not appropriate when any ties occur. 


It is sometimes used as a short-cut procedure for estimating the r between 
scores by first changing the two sets of scores to ranks. If only an approxi- 
mate measure of relationship is desired, this method may yield satisfactory 
lly be tedious when № is as great as 30, 


results, despite ties. It will usua е j E 
however, for the ranking process will be time-consuming and the fractional 


ranks will be hard to square. A worked example appears in Table 21. It is 
based upon 20 pairs of scores in Table 7, page 67, for which the r from 


Table 8 is .65. Note that rho is .67, a diserepancy of only .02. 


98 THE PROBLEM OF MEASUREMENT 


TABLE 20 (Continued) 


Tug Various VaLuEs оғ Ruo (р) ғов Аш, PosstsLe Sums оғ SQUARED 
Deviations (XD?) ror N's ғком 2 THROUGH 10 


yan expe ' 
Le N(N? — 1) 

zD? 9 10 =D? 9 10 =D? 9 10 
170 —.42 —.03 224 —.87 — 36 278 —.63 
172 —.43 —.04 226 —.88 —:37 280 zu 
174 —.45 —.05 228 —.90 —.38 282 —.71 
176 —.47 —.07 230 —.92 —.39 284 —.72 
178 —.48 —.08 232 —.93 —.41 286 —.73 
180 —.50 —.09 234 —.95 —.42 288 —.75 
182 —.52 —.10 236 —.07 —.43 290 —46 
184 —.53 —.12 238 —.98 —.44 292 — 
186 —.55 —.13 240 — 1.00 —.45 294 —.78 
188 —.57 —.14 242 t — AT 296 —.79 
190 —.58 —.15 244 —.48 298 —.81 
192 —.60 —.16 246 —.49 300 —.82 
194 —.62 —.18 248 —.50 302 —.83 
196 —.03 —.19 250 —.52 304 —.84 
198 —.65 —.20 252 —.53 306 —.85 
200 = 607 2.91 254 —.54 308 —.87 
202 —.68 —.22 256 —.55 310 --.88 
204 —.70 —.24 258 —.56 312 —.89 
206 —.72 —.25 260 —.58 314 —.90 
208 —.73 —.26 262 —.59 316 —.92 
210 —.75 —.27 264 —.60 318 —.93 
212 mat —.28 266 —.61 320 —.94 
214 Ls —.30 268 — 62 322 —.95 
216 —.80 —.31 270 —.64 324 —.96 
218 —.82 —.32 272 22465, 326 —.98 
220 —.83 —.33 274 —.66 328 —.90 
222 —.85 —.95 276 —.07 330 — 1.00 


Interpreting the coefficient of correlation. In interpreting the coeffi- 
cient of correlation, two things must be considered. The first is the sign 
of the coefficient. The sign indicates the direction of the relationship. 
Positive coefficients indicate direct relationship; that is, there is a tendency 
for the two series of values to vary in the same direction, high values in one 
column being associated with high values in the other column, low values 
in one column being associated with low values in the other column, and 
so on. On the other hand, negative coefficients indicate inverse relation- 
ship; that is, there is a tendency for the two series of values to vary in 
opposite directions, high values in one column being associated with low 
values in the other column, and high values in that column being associated 
with low values in the first column, 

Another thing is equally important and far more difficult to interpret; 
that is, the magnitude or size of the coefficient. The size of the coefficient 
indicates the degree or closeness of the relationship, just as the sign of the 


STATISTICAL ANALYSIS OF TEST RESULTS 99 


TABLE 21 


ESTIMATING THE COEFFICIENT OF CORRELATION BY THE 
SrEARMAN RANK-DIFFERENCE METHOD 


Scores Ranks Differences in Ranks 
Educational Mental 
Age (EA) Age (MA) EA MA D pt 
188 208 1 2 1 1 
186 218 2 1 1 1 
185 201 3 3 0 0 
183 185 4.5 8.5 4 16 
183 165 4.5 17 12.5 156.25 
182 191 6 6 0 0 
181 185 7 8.5 1.5 2.25 
180 193 8 5 3 9 
179 181 9 10 1 1 
176 165 11.5 17 5.5 30.25 
176 187 11.5 7 4.5 20.25 
176 176 11.5 14 2.5 6.25 
176 180 11.5 11.5 0 0 
175 166 14 15 1 1 
174 197 15 4 11 121 
173 154 16 20 4 16 
171 180 17 11.5 5.5 30.25 
167 165 18.5 17 1.5 2.25 
167 177 18.5 13 5.5 30.25 
165 164 20 19 1 1 
N = 20 У? = 445 


EN AUSE В таа вт, 
NA = 1) 20[(20): — 1] 7.980 


ЕО 


р=1 


coefficient indicates the direction of the relationship. The minimum co- 
efficient is .00, which indicates no consistent relationship whatsoever. From 
this minimum value the coefficients increase in both directions until —1.00 
is reached for one limit, and 1.00 for the other. It should be noted that 
both — 1.00 and 1.00 indicate equally close relationship, for both are per- 
fect. Their one important difference is in direction, the former being in- 
verse and the latter being direct. In like manner, all other values of the 
same size, such as —.50 and .50, indicate equally close relationship. Tt is 
the size, not the sign of the coefficient that gives the clue to the closeness 
or degree of relationship. ; а 

The problem, then, is to know how close a relationship is indicated Буа 
coefficient of correlation of a given magnitude, regardless of sign. For ex- 
ample, how close a relationship is indicated by a coefficient of .60? Unfor- 
tunately, there is no simple way of answering such a question. Attempts 
to indicate this relationship by some descriptive adjective, such as high’ 
or "marked," are vague and often misleading, to say the least. As а matter 


100 THE PROBLEM OF MEASUREMENT 


of fact, a coefficient of .60 might be regarded as high for one type of situa- 
tion and low for another. For example, a coefficient of .60 between a general 
intelligence test administered at the beginning of the year and school 
marks recorded at the end of the year might be regarded as high, because 
such coefficients usually fall below that. But a coefficient of .60 between 
scores on two forms of this intelligence test administered the same day 
would be unusually low. In other words, “high” and “low” have only 
relative meaning. Before an interpretation can be made of a coefficient on 
this basis, the reader must at least know what the central tendency of such 
coefficients for similar data is. 

Expectancy tables. A most helpful way to interpret the degree of rela- 
tionship between two variables is to construct an expectancy table from the 
scatter diagram and inspect it carefully. This may be done with the 
Table 18 data on page 92 by calling scores above 50 on the pretest and 
scores higher than 47 on the midterm exam “above average," as shown in 
Table 22. These cutting points give as near a 50 per cent split of the scores 


TABLE 22 


A Smere Expectancy TABLE, Basen Upon THE 43 PAIRS or SCORES IN THE 
TABLE 18 ScarTERGRAM, FOR WulcH r = .53 


Below Average Above Average 
on Pretest on Pretest Sums 
(30-50) (51-74) 
Above Average at Midterm 7 14 у 
(48-67) gom 30% 20 = 70% 21 
Below Average at Midterm |. 16 6 p 
(12-47) o 70% DT 30% 22 
Sums 23 20 43 


a M 


as possible. Twenty-three persons are “below average" on the pretest and 
20 “above average." Twenty-two students are “below average" on the 
midterm examination and 21 “above average." 

Of those students who were below average on the pretest, 70 per cent 
were also below average on the midterm examination, so the odds are 7:3 
that a person scoring below average initially will six weeks later again score 
below average. The same 7:3 odds hold for those who score above average 
on the pretest; this will not necessarily be precisely true in other similar 
problems because the number of persons “above average" may differ con- 
siderably for the two tests. 

The 7 individuals who went from below average on the pretest to above 
average at midterm might be called “false negatives,” since at first they 
were classified too low, while the 6 who went from above average to below 


STATISTICAL ANALYSIS OF TEST RESULTS 101 


average might be labeled “false positives.” Only 3 of the 7 “false negatives” 
changed greatly; the other 4 were not very low on the pretest or very high 
at midterm. Likewise, none of the 6 “false positives" were very high ou the 
pretest or very low at midterm. Obviously, though the four-cell expectaney 
table provides a useful summary, it is less exact than the scattergram.!5 

Validity and reliability coefficients. One of the most important uses 
of the coefficient of correlation is in determining the validity of a test. There 

- are two types of validity, or rather two methods of judging the validity of a 
test: namely, curricular and statistical. The former is subjective, and the 
latter is objective. Curricular validity is determined by examining the con- 
tent of the test itself and judging the degree to which it is a true measure of 
the important objectives of the course, or a truly representative sampling of 
the essential materials of instruction. Statistical validity is determined by 
setting up a criterion of the thing which it is desired to measure, and then 
computing the coefficient of correlation between the test scores and the 
eriterion. The r so obtained is called a validity coefficient, and is interpreted 
like any other coefficient of correlation. 

A second use of the coefficient of correlation is in determining the reli- 
ability of a test. Since reliability is the degree of consisteney with which 
the test measures whatever it does measure, several ways to determine 
reliability are to be found in computing the coefficient between two forms 
of the same test, two matched halves of a test, or two administrations of 
the same test. 

The r discussed in this section, called the zero-order coefficient of correla- 
tion, is only one of many different types of correlation coefficients, 


G. Measures of Error 
Errors in educational measurement may be grouped conveniently into 
three types, according to source: 


1. Errors of technique: 
a. Arithmetical errors in computation, or the like. 
b. Use of inappropriate measures. 
2. Errors of measurement: 
a. Imperfect measuring instruments, 
b. Unskilled tester. 
c. Fluctuations in the persons measured. 
3. Errors of sampling: 
a. Selection or bias in sampling. ў 
b. Chance fluctuations in random sampling. 


15 А very clear discussion of various types of expectancy tables, of which the one 


: É i А “Expectancy Tables— 
above is the simplest, is contained in Alexander G. Wesman, noy Те 

A Way of Interpreting Test Validity," Test Service Bulletin No. 38, 1-5, December, 
1949, copies of which may be obtained free from the Psychological Corporation, 522 


Fifth Avenue, New York 18, New York. 


102 THE PROBLEM OF MEASUREMENT 


Errors of technique. Obvious types of errors are mistakes in adding 
scores and various computational errors in statistical analysis. The only 
protection against such errors is the exercise of great care. Likely to be 
more serious are the errors due to the use of inappropriate measures for 
the data in hand. It is poor technique to introduce more refined measures 
than the data warrant or the purpose requires. All statistical formulas are 
based upon certain assumptions which often are not fully met in actual 
practice. The following are common examples: Computations based upon 
data in a grouped frequency distribution depend for complete accuracy 
on the assumption that the scores are uniformly distributed within the 
several intervals or that the midpoint of each interval may be used to rep- 
resent the average value of all scores in the interval. The Pearson r as- 
sumes linear correlation between the two variables—constancy of the rela- 
tionship throughout the range of scores. Many formulas are based on the 
assumption of a normal distribution of the measures. Whenever the data 
in a given situation fail to conform to these assumptions, certain errors are 
introduced. Fortunately, in actual practice, these errors are often not great 
enough to introduce serious errors of interpretation. But gross errors due 
to the use of inappropriate techniques are sufficiently numerous to warrant 
extreme caution. Furfey and Daly!* made a study of articles containing 
product-moment 7’s and came to the conclusion that this technique is 
employed “with little regard to the fulfillment of the necessary antecedent 
conditions." In fact, in 60 of the 63 articles studied they found that “their 
authors have left themselves open to the suspicion of having employed the 
correlation technique in a way which is meaningless, if not positively mis- 
leading." 

Errors of measurement. 'l'here are many possible sources of errors in 
measurement, even when there are no computational errors and when the 
most appropriate statistical analysis has been employed. In the first place, 
no measuring instrument is perfectly valid or perfectly reliable. In the 
second place, the personal equation of the examiner must be reckoned with. 
Inexperienced examiners may allow too much or too little time in admin- 
istering the test, or may otherwise depart from standardized procedure in 
administering the test or in scoring the papers. In the third place, there is 
likely to be great variability in the responses of the subjects taking the test. 
Accidental oceurrences, such as the breaking of a pencil point on timed 
tests, fluctuations in motivation, fatigue, and other physical and mental 
factors may seriously affect the test results. 

It will be noted that some errors of measurement are systematic and 
tend to affect, all individuals alike. Allowing too much or too little time 
on a test of reading speed is an example. On the other hand, many errors 
of a variable character occur affecting the individuals unequally or in dif- 


16 Paul H . Furfey and Joseph F. Daly, *Product-moment Correlation as a Research 
Technique in Education," Journal of Educational Psychology, 26: 206-211, March, 1935. 


STATISTICAL ANALYSIS OF TEST RESULTS 103 


ferent directions. Sensory defects, health conditions, and motivation are 
examples of conditions that produce variable errors in measurement, The 
effects of these errors are presented briefly in Table 23. 


TABLE 23 


EFFECTS or Constant AND VARIABLE Errors ON CERTAIN Types OF STATISTICS 


Measure Constant Errors Variable Errors 
Central Tenai i Increased or decreased Usually tend to offset or 
by amount of the error balance each other 
Variability Little or no effect Usually made too large 
Relationship Little or no effect Usually made too small 


Ii will be observed that constant errors affect measures of central tend- 
ency most seriously, and often there are no methods for correcting this bias. 
Errors of sampling. It is usually impractical to measure all the cases 
of a given type. For example, it would be a formidable task to obtain the 
mean IQ of all high-school freshmen in a state, or the difficulty of each 
word in a series of textbooks. Fortunately, it is not necessary to do so. 
It has been found possible to estimate the range within which the true 
measure probably lies. But to do so, one needs a representative sampling 
of the total population. Against errors in a selected or “hand-picked” sam- 
pling there is no statistical protection. An adequate sampling may be 
chosen in a random manner; and the larger the sampling, the better, al- 
though increasing the number of cases does not in itself eliminate the pos- 
sibility of error. The sampling method determines whether or not a biased 
(non-representative) sample will be obtained. j 
Further reading. In a stimulating article entitled “Errors, Estimates, 
and Samples—the Indispensable Concepts,” Charles R. Langmuir” makes 
clear many points that have only been hinted at in this chapter. Three 
other sources of valuable supplementary information are “Making Test 
Scores Meaningful,” 8 “The Three-Legged Coefficient,” ® and “Reliability 
and Confidence.” They will do much to help the statistically untrained 
teacher or administrator understand important aspects of measurement 
theory that might otherwise remain vague. 
d Arthur E 1 “ aluation in the Improvement of 
ааш Ша аа мао d 68-81, Series I, No. 46, April, 


1951. 1785 Massachusetts Ave., N.W., Washington 6, D. C. 
18 William B. Schrader, College Board Review, No. 14: 202-208, May, 1951. 425 West 
117th St., New York 27, New York. 
1 peer G. Wosman, Test Service Bulletin No. 40, s Denn 1950. Psycho- 
logical C tion, 522 Fifth Ave, New York 36, New York. Free. i 
gical Corporation, 523 AE evi Buildin inel 44 1-6, May, 1962. Paychologienl 


Corporation, Free, 


104 THE PROBLEM OF MEASUREMENT 


H. Summary 


The following is an outline summary of three important concepts and 
some statistics useful in connection with test scores and other quantitative 
data: 

1. Central tendency. 

a. The mean, often called the "average" in everyday life, is obtained 
by summing all the scores and dividing this sum by the number of scores, 
It is for most purposes the best measure of central tendency. 

b. The median is a point above which half of the scores lie and below 
which the other half lie. Thus, it is the 50th percentile or Qs. 

c. The mode is the most frequent score, a rather crude measure. 

2. Variability. 

a. The standard deviation (SD ог с) involves every measure in the 
distribution. Approximately two thirds of all scores in a “normal” fre- 
quency distribution lie not more than one standard deviation away from 
the mean. 

b. D, a percentile measure of dispersion, is the distance between the 
90th and 10th percentiles, Four-tenths of D (0.4D) provides a fairly good 
estimate of the standard deviation. 

c. Q, the quartile deviation or semi-interquartile range, is half of the 
distance between Qs (the 75th percentile) and Qı (the 25th percentile). 
Though widely used, Q is usually a poorer measure of variability than D, 
which in turn is somewhat inferior to c. 

d. The range is the distance from the lower real class limit of the 
lowest class to the higher real class limit of the highest class. For most 
purposes it is a very inadequate measure of variability. 

3. Correlation, covariation, or concomitant variation. There are numerous 
ways of expressing co-relationship. The most common statistic is Pear- 
son's r. A simplification of r, applicable chiefly to data originally secured 
in the form of ranks, is rho (p). Both r and p have values of —1 for perfect 
inverse relationship, 0 for sheer chance association, and +1 for perfect 
direct relationship. 

Reliability and validity coefficients are usually 7’s secured under certain 
special conditions. 


I. Instructional Test Items 


Appendix A, pages 429-435, contains 50 five-option multiple-choice 
items covering the material in this chapter. A Лет you have gone through 
Chapter 3 carefully, turn to them and test your knowledge. Refer to the 
chapter as much as you please. 


SELECTED REFERENCES ror FumrHER READING 


Chambers, E. G., Statistical Calculation for Beginners (Second Edition). Cambridge, 
England: Cambridge University Press, 1952. 168 pages. 


STATISTICAL ANALYSIS OF TEST RESULTS 105 


Dixon, Wilfred J., and Massey, Frank J., Jr., Introduction lo Statistical Analysis. 
New York: McGraw-Hill Book Company, 1951. 370 pages. 

Edwards, Allen L., Statistical Analysis for Students in Psychology and Education. 
New York: Rinehart & Company, 1946. 360 pages. 

Garrett, Henry E., Statistics in Psychology and Education (Fourth Edition). New 
York: Longmans, Green & Company, 1953. 460 pages. 

Guilford, J. P., Fundamental Statistics in Psychology and Education (Second 
Edition). New York: McGraw-Hill Book Company, 1950. 633 pages. 

Lindquist, E. F., A First Course in Statistics. Boston: Houghton Mifflin Company, 
1942. 242 pages. 

Odell, C. W., An Introduction to Educational Statistics, New York: Prentice-Hall, 
Inc., 1946. 269 pages. 

Tippett, L. Н. С., The Methods of Statistics (Fourth Edition). New York: John 
Wiley & Sons, 1952. 395 pages. 

Walker, Helen M., Elementary Statistical Methods. New York: Henry Holt & Co., 
1043. 368 pages. 

Walker, Helen M., and Lev, Joseph, Statistical Inference. New York: Henry Holt 
and Company, 1953. 510 pages. 

Yule, G. Udny, and Kendall, Maurice G., An Introduction to the Theory of Statistics. 
New York: Hafner Publishing Company, 1950. 701 pages. 


4 


The Characteristics of a Satisfactory 


Measuring Instrument 


A. Introduction 


Importance of the problem. What are the earmarks of a good test, 
examination, or other measuring instrument? In the-selection of a test, 
as in the selection of an automobile, it is important to know what to look 
for. There is usually a choice among many possibilities which are very un- 
equal in merit. Each year many automobiles are bought because of the 
appeal of some gadget, such as a fancy radiator ornament or cigarette 
lighter; and many standard tests are bought for no better reason. Whether 
` a purchaser buys, or is merely sold, depends largely on whether or not he 
knows what to look for in the artiele in question. Moreover, every teacher 
will have occasion to use tests of his own construction, and should know 
what qualities to strive for in such tests. As a rule, the same characteristics 
are essential in an informal test made by the classroom teacher as in a 
standard test bought ready-made from a publisher. 

In any satisfaetory measuring instrument three qualities are indispen- 
sable. These are: 


l. Validity 
2. Reliability 
3. Usability 


It is essential, therefore, that every teacher have a clear idea regarding the 
meaning of these characteristics, and know how to judge their presence in 
tests, whether standardized or nonstandardized. 

106 


A SATISFACTORY MEASURING INSTRUMENT 107 


В. Validity 


Meaning of validity. One kind of validity concerns the degree to which 
the test or other measuring instrument measures what it claims to. In a 
word, validity means truthfulness.' Does the test really measure what it 
purports to? For example, whether a so-called “arithmetic reasoning test" 
is valid or not depends upon the extent to which it succeeds in measuring 
reasoning ability in arithmetic rather than other things, such as reading 
ability or general intelligence. Validity, then, refers to the truthfulness of 
the test and is always its most important characteristic. No matter what 
other merits the test may possess, if it lacks validity, it is worthless. 
Whether you are selecting a standard test or making an informal test, the 
first thing to consider is its validity. How, then, does one judge whether 
or not a test or other measuring instrument is valid? 

General considerations. The answer to this question may best be 
approached by giving attention to some preliminary considerations of a 
general nature. 


1. The nature of the thing being measured must always determine the 
methods and materials of measurement. In order to judge the validity of 
an intelligence test, for example, it is necessary to consider what intelligence 
is, what its qualities are, or at any rate, how it manifests itself. In like man- 
ner, in order to judge the validity of an achievement test, it is necessary 
to consider what it is that the achievement test is supposed to measure. 
This means that the first step in judging the validity of achievement tests 
is a clear statement of the specific objectives of the course or subject. 

2. Any measurement in education is always a sampling, never entirely 
complete. The test maker relies upon a sample much as does the chemist 
in the health department in passing upon the quality of the city’s water 
supply. In psychological language, any test is merely a series of situations 
designed to call forth a sufficient number of representative responses to 
enable the examiner to determine the amount of the thing in question that 
happens to be present. qM à 

3. The accuracy of the measurement, its fineness of discrimination, will 
depend upon the purpose it is to serve. A cheap alarm clock will usually 
suffice for a housewife in determining when to prepare lunch or to expect 
the postman, but a finer timepiece is required for the locomotive engineer. 
In like manner, a sundial or hour glass may be adequate for a gardener, 
but a split-second watch is essential for a football official. It would be al- 
most as.absurd to attempt to use a sundial to time a football game as to use 
it for measuring temperature or wind velocity. In other words, the validity 
of the measuring instrument must always be considered in relation to the 
purpose it is to serve. Validity is always specific, in relation to some definite 
situation. A test is not just valid; it is valid for something. There is no such 
thing as general validity. 


1 See page 111 for the distinction between curricular and statistical validity. 


108 THE PROBLEM OF MEASUREMENT 


I. The Validation of Intelligence Tests 


Although the job of constructing the so-called tests of general intelli- 
gence is usually turned over to the specialist, a general knowledge of how 
such tests are validated will enable the teacher to select and use them more 
diseriminatingly. 

The meaning of intelligence. What, then, is meant by "general intelli- 
gence,” the thing such tests claim to measure? Although there is no una- 
nimity among psychologists regarding the exact definition of intelligence, 
there is substantial agreement that what existing tests attempt to measure 
is capacity to learn, particularly to learn the academic tasks imposed by 
the school. Such a conception of intelligence is not very "general," after all. 
It is clearly narrower than the popular notion, since it is restricted largely 
to abstract intelligence and leaves out of account social intelligence, me- 
chanical intelligence, and intelligence in special fields such as athletics, 
music, or oratory. 

It is also clear at the outset that intelligence can be measured only in- 
directly; its presence must be inferred from the observed behavior of the 
individual, his reactions to certain carefully chosen and controlled situa- 
tions called tests. Such tests should meet two general requirements: First, 
there must be a sufficiently large and varied assortment of test situations 
to call forth a wide variety of mental operations, primarily of the higher 
type, such as imagination, judgment, and reasoning. Second, the situations 
must be of such a nature that every individual taking the test has had 
approximately equal opportunity to learn, and as far as possible, equal mo- 
tivation. This second standard is hard to meet and is usually only арргоќі- 
mated even in the best tests. It clearly rules out tests that involve special 
talents such as for musie or art, and makes questionable those that depend 
on specific school experience, which is by no means uniform for all pupils. 
In general, group tests meet these standards, especially the second, less 
well than do individual tests. The Army Alpha, for example, not only em- 
ploys such situations as reading vocabulary and arithmetie, but, being 
originally designed for soldiers, has material that is more within the ex- 
perience of men and boys than of women and pirls. 

The Terman criteria. In developing the 1916 Stanford Revision of the 
Binet Seale, Terman relied upon three additional criteria of intelligence: 
namely, age increase, coherency, and world success.” Age increase means that 
each test item must show an increasing percentage of successful responses 
from one year level to the next. This is only a partial criterion, since it must 
assume that the items chosen are of a type that may reasonably be expected 
to measure intelligence. Purely physical measurements, for example, such 
as strength of grip, or speed in running, show age increases. The second 

2 For a discussion of the procedure used in the Revised (1937) Stanford-Binet, see: 


Lewis M. Terman, “The Revision Procedures," in Quinn McNemar, The Revision of 
the Stanford-Binet Scale, pp. 1-14. Boston: Houghton Mifflin Company, 1942. 


è 


A SATISFACTORY MEASURING INSTRUMENT 100 


criterion, coherency, is based on the assumption that the whole test is a 
more valid measure of intelligence than are any of its parts. Upon the basis 
of the entire test, the group is divided into dull, normal, and bright sections. 
Then, to be acceptable, each item must discriminate among the sections 
by showing a progressively increasing percentage of successes as we go 
from dull to bright. This procedure really measures the internal consis- 
tency of the test, much as a logician judges the validity of a course of 
reasoning. Both Galton and Binet used the method of contrasting groups, 
although their groups were selected from external criteria rather than from 
the test itself. : 

The third criterion, world success, is the ordinary common-sense standard 
of everyday life. As the test is validated on children, this really means the 
child's world, which is primarily that of the school, his standing in which 
is reflected in his academic record. This is, of course, not a perfect criterion. 
It is not only highly subjective in character, but it throws the primary 
responsibility ultimately upon the judgment of teachers, which, because 
of its limitations, the test is being designed to replace or supplement. This 
is not so bad as it seems, however, for the basis is not that of the pupil’s 
mark on a single examination, whose notorious unreliability has already 
been described, but rather that of his entire record for an extensive period, 
a far more stable thing. Furthermore, the reliance is usually placed not 
upon the judgment of any single teacher, but rather upon the average of 
several experienced teachers. The consensus of competent persons is the 
ultimate criterion of values from the constitutionality of a law down to the 
beauty contest at the local theater. 

Individual versus group tests. It is generally assumed that an indi- 
vidual test is likely to meet more fully the eriteria described than does a 
group test. Furthermore, the individual test permits the trained examiner 
to observe more carefully the behavior of the subject during the course of 
the examination. For example, if the subject shows signs of nervousness or 
refuses to co-operate fully, the examiner realizes that a valid measure of 
intelligence is impossible under the cireumstant ев, and so waits for a more 
opportune time. Also, if the subject is handieapped by defective vision 
or hearing, this condition is likely to be discovered by the examiner, who 
then takes it into account in making his interpretations and recommenda- 
tions. For these reasons the individual intelligence test is usually taken as 
the criterion or standard for validating the group test. However, some- 
times a group test is validated by comparing the scores made on that test 
with those made by the same individuals on another group test, or possibly 
some combination of two or more group tests. For all such comparisons 
with a criterion, whether it be the Revised Stanford-Binet or some group 
test or tests, the Pearson product-moment coefficient of correlation is usu- 
ally employed. r is then referred to as the validity coefficient. If the agree- 
ment with the eriterion is perfect, the coefficient is 1.00, and if there is no 


110 THE PROBLEM OF MEASUREMENT 


consistent relationship at all, the coefficient is .00. Naturally the nearer the 
coefficient approaches 1.00, the higher the validity is said to be, although 
in the last analysis everything depends upon the appropriateness of the 
criterion itself. Usually the most difficult step in test validation is securing 
an adequate criterion. 


TABLE 24 


INTERCORRELATIONS OF INTELLIGENCE TEsT SCORES AND FrvE-SEMESTER AVERAGE 
GRADES ғов 284 Seniors (124 Bors, 160 GIRLS) IN А Los ANGELES HIGH SCHOOL? 


Inlelligence T'est 


- HR А a SRA Average 
Intelligence yis PE Terman- c id рая PRA Primary Grade 
Test (AUTRE McNemar p ort- у 2a Mental 
ering orm erbal. Abilities 
Boys| Girls| Boys| Girls} Boys| Girls} Boys} Girls| Boys] Girls] Boys} Girls 
Otis И PTB NTE we mz. 36| 24.1.57 1.57 | .36 |144 
Terman- 
McNemar 75 | .77 | — | —]|.70|-70 | .27 | .36 | .56 | .45 | .45 | 52 
California 
Short-Form | .73 | .67 | .70 | .70 | — | — | .81 | .24 | .54 | .56 | .38 | .50 
SRA Non- 
Verba .36 | .24 | .27 | 36 | .31 | .24 S — | .33 | .40 | .23 | .40 
SRA РМА | .57 | .57 | .56 | 45 | .54 | .56 | .33 | .40 | — | — | .42 | .44 
Mean IQ 103.7 105.5 114.2 118.2 96.4 == 


Table 24 contains the intercorrelations of five intelligence tests and their 
correlations with average grades, separately by sex. It also shows the mean 
IQ on each test for the boys and girls lumped together. The Otis Self- 
Administering Higher Examination, Form A, correlates with the Terman- 
MeNemar .75 for boys and .77 for girls, while the Terman-McNemar cor- 
relates only .27 and .36 with the Science Research Associates (SRA) 
Non-Verbal. Apparently the Otis, Terman-McNemar, and California Short- 
Form Test of Mental Maturity (1942 edition) are more closely related to 
each other than to the other two tests. 

The most valid test for “predicting” the five-semester average-grade 
criterion is the Terman-McNemar, with coefficients of .45 for boys and 

3 These figures were secured from the following unpublished (mimeographed) report: 
Walter G. Heil and Alice Horn, A Comparative Study of the Data for Five Different 
Intelligence Tests Administered to 284 Twelfth Grade Students al South Gate High School— 


Los Angeles. Los Angeles: Curriculum Division, Los Angeles City School Districts, 
February, 1950. 25 pages. 


A SATISFACTORY MEASURING INSTRUMENT 111 


.52 for girls; the SRA Non-Verbal is least valid. Note that in each of the 
five instances the r for girls is higher than for boys: .36 and .44, .45 and .52, 
38 and .50, .23 and .40, and .42 and .44. These are not genuine validity 
coefficients, however. All 284 students were tested at the beginning of the 
last semester of their senior year, and the average grades are for the five 
previous semesters of high-school work. Thus there is no prediction here 
from test scores to later school achievement, as is implied by the term 
“validity coefficient.” 

Three such r's are reported by Heil and Horn, though: .19 between 
intelligence-test scores of 78 persons tested in the first semester of the first 
grade and their average high-school marks; .36 for 54 individuals tested 
in the third grade; and .31 for 72 sixth graders. These validity coefficients 
are not high, but neither are some of the figures in the “average grade" 
column of Table 24, which are based upon an average interval of only 24 
semesters instead of 6-113 years. 

It is important to realize that scores on two tests may correlate perfectly 
even though the means are quite different. In other words, the size of г is 
wholly independent of the difference between means, so it tells us nothing 
about how interchangeable scores on two highly correlated tests are. In 
Table 24, the mean IQ/s for the same 284 students go from 96.4 for the 
Primary Mental Abilities total score to 118.2 on the SRA Non-Verbal, 
a difference of 21.8 IQ points! Yet both of these tests are issued by the 
same publisher. On the other hand, the Otis and Terman-McNemar tests 
differ by only 1.8 points. 


II. The Validation of Achievement Tests 


Curricular versus statistical validity. In some respects the validation 
of an achievement test is more difficult than the validation of an intelli- 
gence test, and a greater number of procedures are employed for its deter- 
mination. In discussing the validation of achievement tests a distinction 
should be made between curricular validity and statistical validity. By cur- 
ricular validity is meant the extent to which the content of the test is truly 
representative of the content of the course. Curricular validity implies an 
act of judgment as to the adequacy of the sampling included in the test. 
In the earlier days this was interpreted to mean merely the extent to which 
the items of the test included a representative sampling of the essential 
materials employed in instruction. More recently, however, curricular va- 
lidity is thought of, not primarily in terms of subject matter, which at best 
is merely the stimulus, but rather in terms of the mental reactions expected 
of the pupils themselves. In other words, the center of gravity has shifted 
from the curriculum to the child. 

DE а idi i ests,” Harvard Educational 
Puro : PE edi er World. zm Company as Test Servioo 
Notebook No. 3. 


112 THE PROBLEM OF MEASUREMENT 


Statistical validity refers to the mathematical processes for determining 
the degree to which the test agrees with, or correlates with, some criterion 
which is set up as an acceptable measure of the thing in question. Some of 
these statistical procedures aim at validating the test as a whole and others 
at validating the items individually. Although the procedures commonly 
employed by professional test makers are often rather technical, especially 
for item validation, the essential ideas are relatively simple. 

The technique employed in the preparation of the Cooperative Achieve- 
ment Tests represents an effective combination of statistical analysis and 
the judgment of experts. These tests are constructed by a trained staff 
working in close co-operation with classroom teachers, subject-matter spe- 
cialists, and test technicians. The procedure is outlined as follows:5 


a. Preliminary planning and selection of content. 
Analyses of curricula, textbooks, research studies, etc. 
Formulation of objectives and determination of general plan 
Preparation of detailed test outlines based upon survey of materials 
Submission of outlines to authorities for criticism 
Revision of test outlines in accordance with suggestion of critics 
b. Preparation and editing ot test items. 
Writing of items by test editors and cooperating experts 
Submission of items to authorities for criticism 
Revision of items in view of suggestions received 
Preparation of experimental forms of test 
c. Administration of experimental forms to a representative sampling of students 
to obtain item difficulty and validity indices, and to detect items which may 
be weak or ambiguous. 
d. Preparation of final form, 
Selection and revision of items for tentative final form 
Obtaining from experts in subject-matter fields, test technicians, etc., sug- 
gestions and criticisms of the tentative final form 
Revision and final editing ot the test, based on the criticisms and suggestions 
received 
е. Administration of final form of test with earlier forms for equating and 
determination of scaled scores. 


Perhaps attention should also be called to certain limitations of fre- 
quency of mention or use as a criterion for selecting materials either for 
the curriculum or for the test. In the first place, to accept what 4s as a cri- 
terion of what ought to be leaves no room for progress. For example, someone 
has defined a synonym as a word you use when you do not know how to 
spell the word you want. It can scarcely be doubted that there is a wide 
margin between the words actually used in ordinary speaking and writing 
and those that should be used to convey best the meaning intended. In the 
second place, frequency by its very nature is a poor standard for judging: 
importance. For example, birth and death occur but once in the life history 

* Cooperative Achievement Tests for High School and College Classes, page 5. New York: 


Cooperative Test Service, 1945. These tests are now distributed by the Cooperative 
Test Division of Educational Testing Service, 20 Nassau Street, Princeton, New Jersey: 


A SATISFACTORY MEASURING INSTRUMENT 113 


of an individual, and yet who would say they are for this reason less im- 
portant than dressing and undressing, which occur every day? Frequency 
of use, therefore, although doubtless important as one measure of social 
utility, can rarely be regarded as the best criterion for validating a test. 
It should usually be employed with other criteria, rather than alone. 
Some criticisms of test validity. One of the commonest criticisms of 
the validity of achievement tests, especially those of the objective type, 
whether standardized or nonstandardized, is that they are predominantly 
factual in character. It is alleged that they succeed merely in measuring 
verbal memory as distinguished from genuine understanding, and leave un- 
measured the really important outcomes such as discrimination, judgment, 
intellectual and emotional attitudes, appreciations, and the ability to make 
intelligent application of knowledge to new situations. Even the best friends 
of achievement tests will readily admit that as such tests are commonly 
made and used, the criticism has some merit. In fact, no one has recognized 
the limitation of existing tests more clearly than some of the outstanding 
leaders of the measurement movement itself. Long ago Thorndike wrote:* 


In the elementary schools we now have many inadequate and even fantastic 
procedures parading behind the banner of educational science. Alleged measure- 
ments are reported and used which measure the fact in question about as well as 
the noise of the thunder measures the voltage of the lightning. To nobody are such 
more detestable than to the scientific worker with educational measurements. 


е Thirteen years later Monroe? wrote about the “child-like faith in the 
efficacy of objective tests as instruments for measuring school achievement" 
on the high-school and college levels. Three examples of unwarranted beliefs 
were cited: 


I. Objectivity in scoring is an essential requirement for a satisfactory test, and 
if a test is objective, the scores yielded by it may be considered highly accurate 
measures of school achievement. 

- II. На test has been shown to be highly reliable, the scores yielded by it are 
highly accurate measures of the achievement specified by its announced or implied 
function. 

— HI. A high correlation with a criterion is sufficient evidence to justify the use 
- of the scores yielded by a test as highly accurate measures of the achievement 
considered to be defined by the criterion. 


While all three points are related to validity, the first two are more ap- 
propriately treated in later sections. The third point merits further dis- 
cussion here. 

Validity coefficients are no exception to the general principle which holds 
that all coefficients of correlation are definitely influenced by the variability 


8 Е. L. Thorndike, “Measurement in Education,” Twenty-first Yearbook of the Na- 
tional Society for the Study of Education, Part 1, page 8. Quoted by permission of the 
Society. Bloomington, Illinois: Public School Publishing Company, 1922. 

(^ Walter S. Monroe, "Hazards in the Measurement of Achievement," School and 
- Soeiety, 41: 48-52, January 12, 1935. 


114 THE PROBLEM OF MEASUREMENT 


of the group. For any given type of data a heterogeneous group gives con- 
sistently higher coefficients than a homogeneous group, regardless of other 
considerations. This phenomenon alone makes it impossible to regard the 
validity coefficient as absolutely fixed in magnitude. For example, a co- 
efficient of .60 for a single grade may actually be more significant than one 
of .90 for several grades thrown together. Monroe points out that such 
coefficients also vary with the “general plan of the curriculum and the 
objectives toward which the instruction is directed." And, of course, the 
value of the validity coefficient always depends ultimately upon the ade- 
quacy of the criterion itself. After a test has been once used, it tends to 
influence both the teaching of the instructor and the study of the students, 
so that “а given type of objective test is likely to become less and less valid 
as its use is continued.” In view of these facts most studies comparing the 
validity of one test with that of another are inconclusive, especially so when 
based on different groups. Comparisons of the relative validity of essay 
and objective tests and of various forms of objective tests are still more 
risky, for it is usually impossible to tell whether all forms were equally ap- 
propriate for the objectives in question, were constructed with equal skill, 
or aroused equal degrees of motivation. Such considerations lend support to 
Monroe’s conclusion that the “coefficient of validity calculated for a test 
is a statistic of uncertain value.” One thing is certain: Validity coefficients 
ean rarely be taken at face value. 

Tyler's suggestions. The upshot of the above discussion is that in 
judging the validity of an achievement test, for the present at least, major 
dependence must be placed, not on the statistical analysis of test resulis, bul 
on the logical and psychological analysis of test construction. On this point 
perhaps no criticism has been more influential than that of Ralph W. Tyler. 

The Tyler technique of test construction is based upon a broad concep- 
tion of validity. He regards a valid test as one which affords satisfactory 
evidence of the degree to which the students are actually reaching the 
desired objectives of teaching, these objectives being specifically stated in 
terms of the kinds of behavior expected in the students. Tyler summarizes 
the process as follows:8 


All methods of measuring human behavior involve four technical problems: 
(1) defining the behavior to be evaluated, (2) selecting the test situations, or 
determining the situations in which the behavior is expressed, (3) developing a 
record of the behavior that takes place in these situations, and (4) evaluating the 
behavior recorded. Regardless of the type of appraisal under consideration, whether 
it be the observation of children at play, the written examination, the techniques 
of the psychological laboratory, the questionnaire, or the personal interview, these 
probleme are encountered. The choice of the methods of measurement rests prima- 
rily upon the effectiveness with which the methods solve these problems in the 
particular case under consideration. 


g Thirty-Fourth Yearbook of the National Society for the Study of Education, page 114. 
Quoted by permission of the Society. Bloomington, Illinois: Public School Publishing 
Company, 1935. 


A SATISFACTORY MEASURING INSTRUMENT 115 


A few examples may be helpful. In chemistry the objectives sought by 
the instructors would doubtless include such abilities as understanding tech- 
nical terms, remembering important facts and principles, applying impor- 
tant chemical prineiples to concrete situations, expressing chemical rela- 
tionships by appropriate equations, and acquiring certain laboratory skills. 
In like manner, the objectives in English might include the effective use of 
English in speaking and writing, acquaintance with certain literary master- 
pieces, critical skill in evaluating the major types of literary productions, 
and the appreciation of good literature. These illustrations are sufficient to 
make it clear that no one type of test situation can measure adequately 
such a variety of teaching objectives. However, teachers have often as- 
sumed that the attainment of one objective, usually knowledge, is sufficient 
assurance of like accomplishments in the others. Tyler has shown that this 
assumption is quite as erroneous as that made by Cattell and Galton, who 
in the pre-Binet stage of mental measurement accepted sensory discrimina- 
tion as indicative of the higher mental processes, such as judgment and 
reasoning. He found, for example, that in elementary biology the correla- 
tion between information and the ability to apply principles was .40, that 
the correlation between information and the ability to interpret experi- 
ments was .41, and that the correlation between information and skill with 
the microscope was only .02. Even when allowance is made for the homo- 
geneity of the groups involved, the relationship is far from close. 

Direct versus indirect methods. It has been found over and over 
again that the way to attain an educational objective in teaching is to aim 
at it directly rather than to rely upon transfer of training to bring it about 
indirectly. Apparently the same thing is true of testing also; the way to 
measure the extent to which an educational objective has been realized is 
to aim at it directly wherever possible, rather than to infer its existence 
from the indirect measurement of something else. 

But this principle has often been violated in educational practice. We 
have been content with indirect measurement when we might have had 
direct. The Ayres educational index, for example, was based, not directly 
upon the educational achievement of the several states, but indirectly upon 
the educational opportunities offered, as measured by money expended, 
length of school term, and the like. Even today the important state and 
regional accrediting associations of colleges and secondary schools do not 
as a rule aim directly at the product of these schools, in terms of desired 
changes in pupil bebavior, but rather indirectly at the facilities offered, 
such as size of classes, degrees held by members of the faculty, and number 
of volumes in the library. In view of the low correlation between intelli- 
gence and school grades, which averages less than .50,° and that between 

9 Such 7’s are attenuated. (made lower than they would otherwise be) by the consider- 
able unreliability of teachers’ grades. Correlations between intelligence and achievement 


test scores are usually much higher. As far back as 1927 Kelley concluded that the 
underlying communality of intelligence and achievement is about 90%. See Truman L, 


116 THE PROBLEM OF MEASUREMENT 


the possession of knowledge and the ability to use it, which averages even 
less, one would hardly expect to find the perfect correlation between edu- 
eational facilities and educational performance that such practice appears 
to assume. It is doubtless still true that there are Mark Hopkinses capable 
of transforming mere logs into colleges, while marble palaces may remain 
but piles of stone for lack of such a magic touch. In any ease, it is safer to 
examine what is happening to the student at his end of the log, than to 
remain content to measure the dimensions of the log, or even the creden- 
tials of the individual who happens to be at the other end. 

Evidence concerning the value of indirect measures is conflicting. For 
example, in a study involving a test administered to 300,000 young men 
which measured both intelligence and general achievement, Davenport and 
Remmers” found rather high 7’s between the test means for states and 
such state characteristics as telephones per thousand persons (.83), per 
capita income (.81), value of school property (.76), and Negroes per thou- 
sand persons (—.70). They located and named “state economic,” “rural- 
urban,” and “‘deep-South versus non-South" factors which seemed to ac- 
count for most of the correlations found. Their conclusion 15:1 

These data are all state data; they do not apply to individuals. Without much 
facetiousness, however, we interpret these results to mean that the probabilities of 
reaching a high educational achievement are much greater if one comes from a high 
income state which is highly urban, which is not in the South, and which has such 
advantages as library service available to most of its population, has a high pro- 
portion of foreign-born citizens, a large number of residents in Who’s Who, and 
many telephones, 


Using achievement and intelligence test results from 154 communities, 
large and small, throughout the United States, Thorndike” found 24 com- 
munity variables to be much more highly correlated with intelligence than 
with achievement. In fact, the estimated maximum correlation of these 
community aspects (population in thousands, per cent native-born white, 
and so forth) with achievement-test means was only about .30. Thorndike 
offers several possible explanations of this low relationship. 

It may, of course, be expedient at times to rely upon indirect evidence, 
but at any rate, one should do so only where direct evidence is not available 
and even then with full realization of the risks involved. For example, up 
to the present time test makers have found it difficult to devise suitable 
instruments for measuring such intangible outcomes of teaching as atti- 


Kelley, Interpretation of Educational Measurements, page 196. Yonkers-on-Hudson, 
New York: World Book Company, 1927. 

10 K, В. Davenport and Hermann H. Remmers, “Factors in State Characteristics Re- 
lated to Average A-12 V-12 Test Scores," Journal of Educational Psychology, 41: 110— 
115, February, 1950. 

u [bid., page 115. 

n Robert L. Thorndike, “Community Variables As Predictors of Intelligence and 
Academic Achievement,” Journal of Educational Psychology, 42: 321-338, October, 
1951; “Note of Correction,” 43: 179-180, March, 1952. 


A SATISFACTORY MEASURING INSTRUMENT 117 


tudes, appreeiations, and interests. But it is probably true that the better 
standard tests come far closer to measuring the objectives actually attained 
or aimed at in educational praciice than they come to measuring those 
suggested as desirable in educational theory. It is apparently just as difficult 
to teach these intangible things as it is to fest them. It seems reasonable to 
think that it is no less difficult to provide the appropriate teaching materials 
for bringing about the right kind of attitudes, appreciations, and interests 
than it is to provide the appropriate testing materials for determining how 
well the job is being done. It should be kept in mind that a valid test con- 
sists largely of a representative sampling of the materials that make up the 
course. It should help to clarify the atmosphere once and for all to recognize 
frankly that the less tangible outcomes are harder to teach and to test than the 
more tangible outcomes. And it may be that for some time to come we shall 
have to be content to aim at both indirectly. 

But this is no permanent solution to the problem. One of the important 
services the measurement movement can render education is the clarifica- 
tion of its objectives. The necessity for this should be apparent both to the 
curriculum maker and to the test maker. Considerable progress has already 
been made in this direction, and more will doubtless be forthcoming. The 
pioneering work of Wrightstone? is an illustration. He reports a series of 
tests in the social studies with such a diversity of aims as the interpretation 
of facts, the making of generalizations, the organization of data, several 
important work-study skills, and certain civic attitudes and beliefs. Reports 
of the Eight-Year Study of the Progressive Education Association indicate 
substantial progress in this direction.™ 

Item analysis. Specialists in test construction not only attempt to 
validate the test as a whole against some outside criterion, but also to vali- 
date the items on the test individually, usually against an inside criterion, 
the test as a whole, Frequently an outside criterion would be better, but 
it is often not available. Although many of the processes for item analysis 
are rather technical and complicated, the essential idea is easy to grasp: 
The purpose is to determine the difficulty and the discriminating value of 
each item in the test. Obviously an item missed by everybody or answered 
correctly by everybody who took the test is of no value in differentiating 
between good and poor pupils. If the test is for the purpose of determining 
the extent to which the minimum essentials of a unit or of a course have 
been mastered, however, the difficulty of the individual items is relatively 
unimportant and the matter of discrimination is of minor significance. But 
if the test is to be used over several grades ав а basis of classification or 

13 J, Wayne Wrightstone, “Measuring Some Major Objectives of the Social Studies,” 
School Review, 43: 771-779, December, 1935; also J. Wayne Wrightstone, Appraisal of 
Experimental High School Practices, 194 pages. New York: Bureau of Publications, 
Teachers College, Columhia University, 1936. 


м Eugene R. Smith, Ralph W. Tyler, and staff, Appraising and Recording Student 
Progress, 550 pages. New York: Harper and Brothers, 1942. 


118 THE PROBLEM OF MEASUREMENT 


school marks, the discriminating value of the items is of major importance. 
With the exception of a few easy items at the beginning of such a test for 
the purpose of building morale in the pupils taking it, the items should 
show a percentage of successes increasing progressively from the poorest 
pupils to the best. 

Only the simpler processes need concern the classroom teacher, for there 
is considerable doubt whether the elaborate techniques are enough better 
than the simpler ones to justify the additional labor involved. 

One of the writers has devised a method of item analysis that is simple 
enough to be performed by a reasonably conscientious high school student. 
'This procedure is explained fully in Appendix B, pages 436-453, where a 
typieal classroom test is analyzed. The method can be outlined as follows: 


1. Administer the test and score the papers, preferably putting à red X 
beside each incorrectly answered or omitted question. 

2. On the basis of each student's total score, find the 27 per cent of the 
persons tested who scored highest (had the fewest X’s). Call this the “high” 
group. Find the 27 per cent who scored lowest (had the most X's) and call 
this the *low" group. For instance, if there were 60 persons tested, the 
number in the high group would be 0.27 X 60 = 16.2 = 16. The number 
in the low group would also be 16. This would leave in the *middle" group 
60 — (16 + 16) = 28 papers to be put aside, for they are not needed in 
the item analysis. 

3. Start with Item 1 on the test. How many persons in the low group 
missed it? How many persons in the high group? Subtract the number of 
persons in the high group who missed the item from the number in the low 
group who missed it. 

4. Repeat Step 3 for every question in the test, each time determining 
the difference between the number of students in the low group who missed 
the item and the number in the high group who missed it. 

5. List the differences in ascending order, beginning with the highest 
negative one and going down to the largest positive one, together with the 
item’s number in the test. There should be as many differences as there are 
test questions. The largest positive differences (near the bottom of the list) 
indicate the most discriminating items, while the small positive differences 
and the negative differences suggest that those items are not discriminating 
properly and should therefore be looked over carefully for vagueness, im- 
proper keying, or unattractive options. 

For the complete process, see Appendix В. 


The chief value of item analysis to the teacher is that it helps make better 
1 Julian C. Stanley, “А Simplified Item-Analysis Procedure," American Paychologist, 


6: 369, July, 1951. Abstract of paper read at the American Psychological Association 
convention in Chicago on September 1, 1951, 1 


A SATISFACTORY MEASURING INSTRUMENT 119 


tests by pointing out unsuspected flaws in items that would otherwise 
probably continue to appear in these and later questions. 

Frequently, perhaps generally, it will be found that the trouble is in the 
wording of the item, the language being vague, ambiguous, or positively 
misleading. In that case a rewording of the item may be all that is neces- 
sary. At times, however, the difficulty is more obscure, and the item may 
have to be eliminated altogether. Lindquist found that “adequately” and 
“advisers” were equally difficult for eighth-grade spellers, but that the 
former discriminated in favor of the good spellers and the latter in favor 
of the poor spellers. Difficulty alone, therefore, is not a dependable measure 
of discrimination, for according to that criterion both items are equally 
good. Test experts have usually found, however, that the average difficulty 
of the items in a test is related to the adequacy of the test as a whole. The 
rule suggested for the construction of tests to discriminate best among all 
the members of a group is to make every item of 50 per cent difficulty when 
corrected for chance, so far as possible. This will mean that virtually all 
items of 0-15 per cent and 85-100 per cent difficulty when corrected for 
chance will be omitted from the revised form of the test, unless they can be 
rewritten to make them closer to the 50 per cent difficulty level. 

Tests designed primarily for instructional purposes, however, may at 
times be made much easier with good results. 

Judging the validity of standard tests. It is always desirable to 
examine with some care the content of a standard test before deciding to 
use it. Some of the earlier tests in particular contained serious errors. 
Upton” called attention to some of these in arithmetic tests, and Dia- 
mond'8 found 318 errors in 3,303 items making up the content of sixteen 
widely used tests in biology and general science. Only one test was found 
to be entirely free from error. A study of five tests of English usage revealed 
that from 16 to 55 per cent of the items called wrong were actually accept- 
able according to standards published by the National Council of Teachers 
of English.” Even when there are no errors, the items used often stress the 
relatively unimportant aspects of the subject. 

The test manual also should be examined, because it frequently gives 
data on the validity of the test; for example, it should tell who made the 
test, how the items were chosen, what standardization and validation pro- 
cedure was followed, and other pertinent information. If the author does 


16 Correcting test scores for “guessing” is discussed on page 156. Items are corrected 
in a similar manner, as explained in footnote 32 on page 160. ў 

17 Clifford B. Upton, “Тһе Influence of Standardized Tests on the Curriculum of 
Arithmetic,” Teachers College Record, 26: 627-641, April, 1925. 

18 Leon N. Diamond, "Testing the Test-Makers,” Schoot Science and Mathematics, 
32: 490-502, May, 1932. ; у 

19 Karl W. Dykema, “On the Validity of Standardized Tests of English Usage,” 
School and. Society, 50: 767, December 9, 1939. 


120 THE PROBLEM OF MEASUREMENT 


not give such information to the prospective users, it is safe to assume that 
the test is of doubtful validity, for it is evident that the author does not 
attach as much importance to the matter as is desirable.? While it is un- 
necessary to ascribe improper motives to test authors and publishers, most 
of whom are of a very high type, it is important to recognize that they are 
nevertheless human, and it is reasonable to make some allowance for a 
little overenthusiasm about the merits of their own progeny. Whenever 
available, therefore, the results reported in the professional literature by 
other users are likely to be especially valuable. 

A systematic attempt to provide the information required by the test 
user as a basis for an intelligent choice of tests has been made by Витоѕ 21 
In a series of Mental Measurements Yearbooks he plans to make available 
critical evaluation of all recent tests by one or more competent persons 
independently. These publications will be found indispensable in selecting 
tests. The reviewer is instructed to make the reviews “frankly critical" and 
“to base the appraisals upon his own criteria as to what constitutes a good 
test." The reviewers are described as follows: 


Tn selecting reviewers an effort was made to choose persons representing a wide 
variety of positions and viewpoints among actual and potential test users. As a 
result, a very heterogeneous group of reviewers have cooperated in the preparation 
of this volume—classroom teachers, city school research workers, clinical psy- 
chologists, curriculum specialists, guidance specialists, personnel workers, psycholo- 
gists, subject-matter specialists, and test technicians. . . . It can be truly said 
that the reviewers represent no one group or school of thought, unless the reviewers 
are described as representing all test users—actual and potential—who are con- 
sidered especially competent in their fields and who have the courage to speak 
frankly and honestly in appraising a standard test. . . . 

Ideally, a reviewer of a standard test, such as a high school Latin test, ought to 
possess the qualifications of a curriculum and teaching specialist, and a test tech- 
nician. Unfortunately, all of these qualifications are rarely found in any one 
person. . . . The average quality of the reviews is likely to be highest when the 


20 The American Psychological Association and other similar professional organiza- 
tions have become concerned with the quality of test manuals and the methods of 
distributing tests. With regard to “interest inventories, personality inventories, projec- 
tive instruments and related clinical techniques, and tests of aptitude or ability,” see 
“Technical Recommendations for Psychological Tests and Diagnostic Techniques,” 
Psychological Bulletin, 51: 1-38, March, 1954. 

A more general source is “Information Which Should Be Provided by Test Publishers 
and Testing Agencies on the Validity and Use of Their Tests,” papers read by Herbert 
S. Conrad, Paul L. Dressel, and Laurance F. Shaffer, Proceedings of the 1949 Invitational 
Conference on Testing Problems, pages 63-90. Princeton, New Jersey: Educational Test- 
ing Service, 1950. 

? Oscar К. Buros, The Fourth Mental Measurements Yearbook. Highland Park, New 
Jersey: Gryphon Press, 1953. 1189 pages. This yearbook covers the years 1948 through 
1951 and therefore supplements rather than supplants the older yearbooks. 

For bibliographies and discussions of tests during the period August 1, 1949, through 
July 31, 1952, see Frederick B. Davis (Editor), “Educational and Psychological Test- 
ing,” Review of Educational Research, 23: 1-110, February, 1953. 


A SATISFACTORY MEASURING INSTRUMENT 121 


reviewers discuss only those points which they feel most competent to appraise, 
even though this practice frequently resultsin reviews which are not comprehensive. 


The group judgment of even the most competent persons has certain 
limitations. There remains the troublesome fact that no one test is equally 
valid for all purposes, or for the same purpose in all situations. Further- 
more, there is no way of knowing when a new test may appear with merits 
so outstanding as to render obsolete earlier tests that have hitherto been 
entitled to high comparative ratings. But, all things considered, the best 
available sources of information are probably the measurement specialists 
in reputable colleges and universities, of which there are several in most 
states. Their recommendations can usually be relied upon to be impartial 
and based upon a wider acquaintance with existing tests than the average 
teacher or school administrator is likely to have. But in the final analysis, 
when all the cards are on the table, the teacher or administrator must rely 
upon his own judgment. The necessary background for making such a 
judgment intelligently should be specifically provided for in the professional 
training of teachers. The data required for such judgments should be made 
available by the test publishers and by such publications as the Mental 
Measurements Yearbooks. 


C. Reliability 


Meaning of reliability. By reliability is meant the degree to which 
the test agrees with itself. To what extent can two or more forms of the test 
be relied upon to give the same results, or the same test to give the same 
results when repeated? If the scores on the test are stable under these con- 
tions, the test is said to be reliable. In a word, reliability means consistency.” 

The terms reliability and validity are often confused, but there is a clear- 
cut distinction between them. Reliability, as such, has nothing to do with 
the truthfulness of the measurement, but is concerned only with its con- 
sistency, an entirely different thing. A homely illustration may help to 
clarify the distinction. A man returns from his vacation with a picturesque 
story of the fish he claims to have caught. As he meets friend after friend, 
there is always the same glowing account, even to the minutest detail. 
Now, in a statistical sense the story is reliable, for it is certainly consistent. 
Unfortunately, the fisherman’s veracity is not thereby established, for con- 
sistency by itself gives no assurance of truthfulness or validity. In reality 
the story might be sheer fiction from beginning to end. 

Importance of reliability. Shakespeare said: ‘Consistency, thou art 
a jewel,” and he was right. But consistency is not the greatest jewel, 


22 Oscar К. Buros, The Nineteen Forty Mental Measurements Yearbook, pages 12-13. 


Highland Park, N. J.: Gryphon Press, 1941. Я ле 
23 For а thorough discussion of this concept, see Robert L. Thorndike, “Reliability,” 


in E. F. Lindquist (Editor), Educational Measurement, pages 560-620. Washington, 
D. C.: American Council on Education, 1951. 


122 THE PROBLEM OF MEASUREMENT 


whether in a test or elsewhere. By itself consistency, or reliability, is a 
doubtful virtue, for a test, as well as a person, might be consistently wrong, 
but its absence is a sign of weakness. Although high reliability is no guar- 
antee that the test is good, low reliability does indicate that it is poor. 
In the above illustration it should be noted that had marked discrepancies 
occurred in the fisherman’s story from time to time, considerable doubt 
would have been east upon his truthfulness. Validity is always the first 
quality to be sought in a test, and, granted that, reliability is a valuable 
auxiliary. The ideal test tells the truth consistently. 

There can be little question that test makers have given too much at- 
tention, relatively, to determining the reliability of tests, and too little to 
establishing their validity. One reason for this, doubtless, is that the former 
is easier to determine. Much harm has resulted, however, when uncritical 
users have naively assumed that reliability insures validity, a view which 
is wholly erroneous. 

Methods of determining reliability. The term reliability is purely a 
statistical concept. Contrary to what was found in the case of curricular 
validity, little can be told about the reliability of a test from examining 
the test blank itself. It is, of course, true that if a test can be objectively 
scored, it is more likely to be reliable than if the scoring is subjective; but 
the degree of reliability cannot be determined by that fact. It is also true 
that a long test has a greater likelihood of being reliable than a short test; 
but there are many exceptions. In the last analysis, however, somebody 
must try the test out to determine its reliability. Usually the author of the test 
does this and reports the results in the test manual. If such is not the case, 
one has a right to be suspicious of the merits of the test. 

Method with two test forms. Three rather distinct techniques are 
used to establish the reliability of a test. The method commonly used by 
makers of standard tests is to prepare two or more parallel forms of the test, 
and then to give these equivalent forms of the test to a large number of 
pupils, usually with only a short interval between the tests. The test is said 
to be reliable if there is close agreement between the scores on the two 
forms; that is, if the pupils who made high sceres on the first test also make 
high scores on the second; if those who made low scores on the first test 
again make low scores on the second; and so on for all ranks in between. 
If the agreement is perfect, as is most unlikely, the correlation is 1.00. 
On the other hand, if there is no consistent relationship, the coefficient of 
reliability is .00. It will be recalled that validity is also expressed as a co- 
efficient of correlation whose maximum value is 1.00, and whose minimum 
value is .00. But in the case of statistical validity the agreement is with an 
external criterion, whereas in the case of reliability the agreement is with 
an internal criterion of some kind. In the above illustration this internal 
criterion is another form of the same test, which presumably measures the 
same functions as the first test. 


A SATISFACTORY MEASURING INSTRUMENT 123 


Methods with one test form. When only one form of a test is avail- 
able, probably its reliability can still be determined. One procedure is to 
repeat the test at a later time and to determine the extent of agreement 
by computing the coefficient of correlation between the two series of scores. 
Another procedure is to give the test once only and then to record two 
scores for each paper, one for each half. The test is split into halves matched 
for content and diffieulty. When the two series of scores are obtained, the 
coefficient of correlation between them is computed. This is the reliability 
of the half test. The reliability of the whole test is then estimated by the 
use of a special formula.” A similar formula also makes it possible to esti- 
mate the probable reliability of the test when increased to any required 
length, assuming that the items added are of the same type and quality 
as those in the original test. 

Both methods for obtaining the reliability of a single form of the test have 
been severely criticized and as stoutly defended. The test-retest method 
has certain serious limitations. If the test is long, to avoid fatigue and bore- 
dom some time must elapse between the two trials. In the case of achieve- 
ment tests, particularly, this délay is likely to introduce other variables. 
The pupils may discuss the test between trials, do extra study, or do other 
things that may effect a change in the status of their knowledge. In addi- 
tion to this, their physical and mental conditions fluetuate from day to day, 
even from hour to hour. For example, Ashbaugh?* found variability in one 
fourth of the pupils who were given the same spelling test under highly 
constant conditions three times within fifteen minutes. One would appre- 
ciate the difficulty in determining the reliability of a certain type of ther- 
mometer by checking the readings made at one hour against those made 
later in the day. Guilford thinks that “it is safe to say that the average 
test scale of mental ability is fully as reliable, or probably more so, than the 
average clinical test in medicine, such as the test of blood pressure or the 
basal metabolism test, whose reliability ranges from about .60 to .90.^^26 
But there is also the contrary tendency in human beings for errors made 
the first time to persist and to be repeated at later times. In an extreme 
situation, where the pupils memorized the first series of answers, the ap- 
parent reliability of the test would be perfect. Indeed, this tendency to echo 
the original responses appears to be strong, for test-retest coefficients are 
usually higher than those arrived at by correlating halves or equivalent 
forms of the test. Because the correlation of the half-tests eliminates, or 


24 This is the simplest form of the Spearman-Brown “prophecy” or “step-up” formula, 
which appears in many test manuals: the reliability of the whole test equals twice the 
r between the half-test scores, divided by (1 + the r between half-test scores), being 

27у 
1+ ne i х 
25 Ernest J, Ashbaugh, ‘Variability of Children in Spelling," School and Society, 


9: 93-98, January 18, 1919. у 
26 J, P, Guilford, “Intelligence Tests,” Education, 58: 528, Mav. 1938. 


written in symbols as 7, = 


124 THE PROBLEM OF MEASUREMENT 


at any rate greatly reduces, memory carryover, it is recommended by some 
writers. 

The split-half procedure, involving as it does the Spearman-Brown for- 
mula, is based upon certain assumptions. The half-tests should be of equal 
variability, and the items in one half must be of the same quality as those 
in the other half. It must be emphasized that the formula requires the use 
of matched halves of the test, not just any halves.” 

Kuder and Richardson* have devised several methods of obtaining a 
reliability coefficient which make it unnecessary to split the test into halves 
or caleulate a coefficient of correlation. Unfortunately, their most usable 
formula, called KR No. 20, is very laborious for the classroom teacher to 
compute. 

As Cronbach? aptly points out, the comparable-forms, split-half, and 
test-retest reliability coefficients get at different aspects of reliability. The 
first is a "coefficient of equivalence and stability," the second a “coefficient 
of equivalence" only, and the third a “coefficient of stability.” 

A strong word of caution is needed. Neither the split-half method nor the 
various Kuder-Richardson formulas are applicable to “speeded” tests, for they 
overestimate their reliability. A test is speeded when many of the examinees 
would have made better scores had they been given more time. If nearly 
all persons did about as well in the time allowed as they could have done 
in a longer period, then the measuring instrument is a “power” test. In 
practice, most timed tests involve both speed and power. Watch for the 
rather frequent split-half or KR reliability coefficients reported in test 
manuals for speeded tests; they are deceptively high. 

One of the writers has published a simplified computational technique 
for securing split-half reliability coefficients from unspeeded tests. Another 
shortened procedure is illustrated in Appendix B on pages 452-453. 

The interpretation of test reliability. What standard shall a test 
meet in order to be considered satisfactory from the standpoint of reli- 
ability? No simple answer to this question is possible. It depends, for one 
thing, upon the fineness of discrimination required. Kelley*! has suggested 

27 However, Clark has shown that the variation of split-half r’s from sample to sample 
is much greater than the variation of such r’s within a sample due to different methods 
of splitting, provided that the method of splitting is longitudinal—puts items through- 
out the test in each half, rather than having one half consist of the first items and the 
other half of the last. See Edward L. Clark, “Methods of Splitting vs. Samples as 
Sources of Instability in Test-Reliability Coefficients,” Harvard Educattonal Review, 
19: 178-182, May, 1949. 

28 G. F. Kuder and M. W. Richardson, “The Theory of the Estimation of Test Relia- 
bility," Psychometrika, 2: 151-160, September, 1937. 

? Lee J. Cronbach, Essentials of Psychological Testing, pages 65-73. New York: 
Harper & Brothers, 1949. 

% Julian C. Stanley, “А Simplified Method for Estimating the Split-Half Reliability 


Coefficient of a Test," Harvard Educational Review, 21: 221-224, Fall, 1951. 
# Truman Lee Kelley, op. cit., pages 28-29, 


A SATISFACTORY MEASURING INSTRUMENT 125 


the following minimal requirements for the reliability coefficients of a single 
school grade: 

.50 for determining the status of a group in some subject or group of subjects. 

.90 for differentiating the achievement of a group in two or more scholastic lines. 

ve ir differentiating the status of individuals in the same subject or group of 
subjects. 

.98 for differentiating individuals in two or more scholastic lines. 

The interpretation is also beset by many other difficulties. The coefficients 
not only reflect somewhat the methods employed in their computation, 
but also the variability of the groups, the interval between tests, and other 
factors. For example, a test of average difficulty for a typical group may 
be much less reliable when used with a markedly inferior or markedly 
superior group. Е 

In view of the fact that measures of reliability, no matter how arrived at, 
are influenced by factors other than the form and content of the test itself, 
it would appear that the value of such measures has been overemphasized.*? 
The same energy devoted to improving the validity of the test would bring 
better returns. It is not likely that the average teacher will find it profitable 
to compute reliability coefficients for ordinary class tests, although it may 
sometimes be worth while to do so for final examinations. 

Objectivity and reliability. By objectivity in a measuring instrument 
is meant the degree to which equally competent users get the same results. 
Ordinary measures of height and weight, for example, are objective, while 
estimates of beauty and integrity are subjective. The distinction between 
objective measurement and subjective measurement is implied in the ques- 
tion: *Do married men really live longer than single men, or does it just 
seem longer?" As a rule, objectivity is very closely associated with reli- 
ability. For this reason standard tests are usually more reliable than rating 
scales. As a matter of fact, great impetus was imparted to the objective 
test movement by the discovery that the major cause of the notorious un- 
reliability of the ordinary school examination was its subjectivity of mark- 
ing. The emphasis on objectivity has since gone so far, however, that many 
educational workers seem to regard “objectivity” as synonymous with 
“scientific method.” To such persons, any element of subjectivity: in a study 
renders it hopelessly unscientific. It may be well, therefore, to look care- 
fully at this all-important matter of objectivity. 

To discover at the outset that there is no such thing as a wholly objective 
measure may be something of a shock. The plain fact is that objectivity 
is always relative, never absolute. The measurements obtained by a yard- 
“32 Some writers would abandon altogether the “blanket term reliability" in favor of 
more specific estimates of absolute and relative accuracy of measurement. Robert W. B. 
Jackson and George A. Ferguson, page 25 in “Studies on the Reliability of Tests,” 
Bulletin No. 12 of the Depariment of Educational Research, University of Toronto, 371 
Bloor Street West, Toronto 5, Ontario, Canada, 1941, 


126 THE PROBLEM OF MEASUREMENT 


stick, for example, are only relatively objective, for one would hardly expect 
a dozen different persons to get absolutely the same results in measuring 
the length of the playground. They would probably agree to the nearest 
foot, and possibly to the nearest inch, but they would usually disagree 
markedly, if the results were expressed in some such small unit as hun- 
dredths of an inch. And, of course, such units as inch, foot, and yard are not 
natural units, like day and year, but units set up by human judgment. 
Brownell points out that there are always many subjective factors in- 
volved even when the test used is of the so-called objective type. He says:* 


Well, first of all, in the practical circumstances of teaching, one decides to give a 
test. The decision is surely not based upon purely objective considerations. Second, 
one determines whether to make a test or to buy one. . . . Third, one makes up 
one's mind regarding the kind of test—whether it is to be of the traditional type, 
of the newer types, or a eombination—judgment again. Fourth, one settles upon 
the scope of the test—judgment once more. Fifth, one selects the items to be 
included—iittle objectivity here. Sixth, one chooses the form to be employed— 
true-false, multiple choice, or what not—again little objectivity. Seventh, one 
frames the items as carefully as one can—and once more has only his judgment for 
guidance. Eighth, one prepares a key by listing the correct answers—a judgment 
which may not be acceptable to other teachers even of the same subject. Ninth, 
through opinion one defines the conditions of administering the test. Tenth, one 
scores the papers—at last objectivity. But, eleventh, one assigns marks—another 
increment of judgment, and a big one. 


Brownell protests against what he regards as the overemphasis on ob- 
jectivity, which he thinks has unnecessarily lessened the depth and nar- 
rowed the range of measurement. A safe position would appear to be to 
try to make measurement as objective as possible without sacrificing validity. 
It must be remembered that the latter is always more important. It is 
never going to be possible or desirable to eliminate certain basic assump- 
tions underlying all attempts at evaluation. Undoubtedly at times, how- 
ever, we have made assumptions in measurement when we should have had 
evidence. Many test makers, for example, have assumed that one problem 
of a type is sufficient for diagnosis in arithmetic. When the matter was 
actually subjected to experimental analysis in two studies,” both found 
that one problem of a type is likely to be both unreliable and invalid, owing 
largely to chance, and that at least three problems of each type must be 
included for satisfactory individual diagnosis. Another assumption which 
did not check with the evidence was that objectivity of scoring guarantees 
accuracy of scoring. Several studies have demonstrated the fact that scorers 


33 William A. Brownell, “The Use of Objective Measures in Evaluating Instruction." 
Educational Method, 13: 401-408, May-June, 1934. 

84 Leo J. Brueckner and Mary Elwell, “Reliability of Diagnosis of Error in Multipli- 
cation of Fractions,” Journal of Educational Research, 26: 175-185, November, 1932; 
Foster E. Grossnickle, "Reliability of Diagnosis of Certain Types of Error in Long 
Division with a One-Figure Divisor," Journal of Experimental Education, 4: 7-16. 
September, 1935, 


A SATISFACTORY MEASURING INSTRUMENT 127 


of standardized tests must be taught and not merely told how to do it.” 
A serious effort should of course be made to eliminate all needless types of 
subjectivity. A guess is usually a poor substitute for actual knowledge. 


D. Usability 


Meaning of us. бау. There is quite general agreement among au- 
thorities in measu nt that the two most important characteristics of à 
measuring instrumentare validity and reliability. Both have to do with the 
theoretical accuracy with which the instrument measures. However there 
are certain other considerations of a practical character which must be 
taken into account. In the judgment of the writers all of these may be con- 
veniently designated by the single term usability. By this is meant the 
degree to which the test or other instrument can be successfully employed 
by classroom teachers and school administrators without an undue expend- 
iture of time and energy—in a word, usability means practicability. A meas- 
uring instrument must not only be valid and reliable but also usable. This 
viewpoint is well expressed in The Methodolgy of Educational Research :** 


But we must always temporize ideals with practical considerations. Perhaps an 
ideal instrument would be so cumbersome and expensive of effort and time that 
its use would not be warranted. 


Whether or not a test is usable by average teachers in service and other 
persons whose technical training in measurement has been limited depends 
upon several factors, of which the following are probably the most im- 
portant: 


1. Ease of administration. 

2. Ease of scoring. 

3. Ease of interpretation and application. 
4. Low cost. 

5. Proper mechanical make-up. 


Fach of these factors will now receive brief consideration. 

Ease of administration. Group tests, as a rule, are much easier to 
administer than individual tests. The Stanford-Binet is a good example of 
a test whose validity and reliability are high, but whose usability is low, 
largely because of complicated instructions for giving and scoring. Special 
training in a college course for one semester is usually suggested as the 
minimum required for mastery of these instructions. Even then the test 


makes heavy demands upon the examiner’s time. 
There are, of course, two types of instructions for a test. One has to do 


35 For helpful suggestions see Arthur E. Traxler, “Administering and Scoring the 
Objective Test,” in E. F. Lindquist (Editor), Educational Measurement, pages 329-416. 
Washington, D. C.: American Council on Education, 1951. 

2 Carter V. Good, A. S. Barr, and Douglas E. Scates, The Methodology of Educational 
Research, page 439. New York: D. Appleton-Century Company, 1936. 


128 THE PROBLEM 0F MEASUREMENT 


with directions to the examiner, and the other has to do with directions 
to the pupil or pupils. But, in general, the requirements are the same for 
both. The motto of a certain news weekly indicates what is required: the 
directions should be “clear, curt, complete." Whether or not examples, 
fore-exercises, and the like are necessary will depend mainly upon the age 
and experience of the group being examined. Whether or not a group test 
is easy to administer depends to a considerable extent upon the complete- 
ness of the manual. Some tests have no time limits, many have generous 
time limits, while still others are broken up into intervals as short as 3, 5, 
8, 10, or 15 seconds. These short intervals are difficult to observe with a 
stop watch and well-nigh impossible without it. On the other hand, tests 
of the so-called self-administering type involve only one short set of direc- 
tions for the entire test. Most tests, however, are broken up into separate 
sections, each of which has its own direetions and time limit. In determining 
how difficult a test is going to be to administer, a careful examination must 
be made both of the manual and of the test blank itself. 

Ease of scoring. The ease of scoring a. test depends primarily upon 
three things: objectivity, adequate keys, and full scoring directions. The 
better standard tests rank high on all three counts. Scoring is also facili- 
tated when the pupil has been instructed to record his answers in a straight 
column rather than irregularly over the page, and in the form of a numeral 
or single word rather than a phrase or longer statement. As a rule, all ac- 
ceptable answers should appear on the key. With the exception of scales 
in which score values of unequal weight are required, each correct item 
should count the same as any other correct item in the test. The unequal 
weighting of items, so common in the earlier tests, has been found to add 
to the difficulty of scoring without a corresponding increase in validity or 
reliability. 

Three ways to speed up the scoring of objective tests are (1) hand- 
scorable separate answer sheets, (2) machine-scorable separate answer 
sheets, and (8) self-scoring answer sheets. Traxler says:? 


The growing tendency to employ separate answer sheets is perhaps the most 
pronounced single trend in objective testing. It unquestionably has a restrictive 
effect upon test construction, for it tends to force test items into a single-response 
pattern—the multiple choice question. The use of other types of questions is not 
precluded, however, if test makers will use skill and ingenuity in setting up their 
answer sheets. More imagination and care than are typically employed in devising 
a test need to be used when setting up separate answer forms. 

. . . there is not, at present, enough experimental evidence to warrant a definite 
statement concerning the effect of separate answer sheets upon the validity of the 
test results. It is appropriate to urge, however, that test authors have an inescapable 
obligation to find by means of research an answer to this question before they “go 
overboard” completely for answer sheet procedures. 

The main justification for the use of separate answer sheets is that they are in- 


27 Arthur E. Traxler, op. cit., page 413, 


A SATISFACTORY MEASURING INSTRUMENT 129 


expensive, that they save time, and that they bring objective tests within the reach 
of hundreds of schools that could otherwise not use these tests at all. Through the 
continued availability of these inexpensive instruments, objective testing may 
eventually be brought to all schools in the United States. 


Concerning machine scoring Traxler is less optimistic, pointing out that 
even for very large testing programs “Machine scoring may not . . . be 
faster or cheaper than manual scoring of a prescribed uniform program 
where all manual scoring procedures can be highly routinized."** Further- 
more, unless carefully supervised, machine scoring may be less accurate 
than independently checked hand scoring. The chief advantage of machine 
scoring seems to be in those situations where the machine is kept going all 
day long and is operated by a full-time, highly skilled person. 

Many self-scoring devices are available. Typical of these are “hat-pin” 
punching methods utilized in the Kuder Preference Record* and the Ohio 
State University Psychological Test;# the concealed carbon of the Clapp- 
Young self-marking tests? and Scoreze;“ and the “punchboard” procedure 
of the Science Research Associates self-scorer,“* where the student punches 
until he finally obtains the correct answer. All of these probably have con- 
siderable merit as instructional and motivational devices, but if scores are 
desired for evaluational purposes they can at the present time usually be 
secured more easily and dependably by other methods, except perhaps for 
the Kuder Preference Record. 

Ease of interpretation and application. Whether or not the results 
of a test are easy to interpret and apply depends primarily upon the ade- 
quacy of the manual accompanying the test. In the first place, the manual 
should contain complete norms to facilitate interpretation. Whenever pos- 
sible, all derived scores should be capable of being read directly from tables 
of norms without the necessity of computation. The norms should, as a rule, 
be based both on age and on grade; and, in the case of high-school achieve- 
ment tests, on the length of time the subject has been studied. It is also 
desirable that achievement tests be provided with separate norms for urban 
and rural pupils, and for pupils of various degrees of mentality. Up to the 
present time very few tests are adequately provided with norms for inter- 
pretation. Where the primary emphasis is upon diagnosis and other in- 
structional values of tests, this loss is not very great. In any event, it will 
usually be necessary to rely heavily upon local norms.“ 


38 Arthur E. Traxler, ibid., page 408. ; , ^u : 

39 Sidney L. Pressey, “Development and Appraisal of Devices Providing Immediate 
Automatic Scoring of Objective. Tests and Concomitant Self-Instruction,” Journal of 
Psychology, 29: 417-447, April, 1950. | 

40 Published by Science Research Associates. ATEM р 

а Published by Ohio College Association, Ohio State University, Columbus 10, Ohio. 

*? Published by Houghton Mifflin Company. 

43 Published by California Test Bureau. 

44 Published by Science Research Associates. 

4 A fuller discussion of norms will appear in Chapter 10. 


130 THE PROBLEM OF MEASUREMENT 


Several of the better manuals give specific suggestions regarding the use 
to be made of the test results.* Supplying some suggestions as to results is 
a valuable service for which it is hoped test publishers in the future will 
accept more responsibility. For many uses it is necessary to have at least 
two forms of the test equated both as to content and as to difficulty 
throughout the full range of scores, and not for averages only. Few tests 
meet this requirement fully. 

Cost. With the exception of certain laboratory apparatus and equip- 
ment for measuring special abilities and disabilities, testing materials are 
usually not very expensive. Few achievement tests covering a single sub- 
ject, or group tests of general intelligence, cost more than ten cents each. 
Batteries covering several subjects when printed as a single booklet usually 
cost from ten to fifteen cents. For a comprehensive testing program the 
general battery will cost less than separate tests covering the same subjects. 
Cost is a practical consideration in most school systems, and there is no 
point in paying more for tests than necessary. 

While it may be true in general, as commonly held, that in the long run 
one gets about what he pays for, there are too many exceptions to make it 
a safe rule. In statistical terminology the correlation between the cost of a 
test and its worth is positive, but too low for accurate prediction. Here, 
as elsewhere, the customer should be wary, lest he not get his money’s 
worth. In a test, as in an automobile, the quality is often not evident on the 
surface. The prospective purchaser should not make cost a primary con- 
sideration, for good tests are often no more expensive than poor ones. 
Fortunately, therefore, relative cost can be considered a minor matter, as 
a rule, and the choice of the test can rest, as it should, upon its validity and 
reliability for the purpose it is to serve. It must be remembered that one 
test may be cheap enough at fifteen cents and another too costly at five. 
After all, the careful purchaser is more concerned with what he gets for his 
money than with what he has to pay. 

Mechanical make-up of the test. Tests issued by the larger publishers 
are almost always printed in clear type of a size appropriate to the grade 
level for which they are intended. But there are some exceptions. One of 
the leading publishers issued a test in which the key word in each sentence 
was supposed to be in bold-faced type, but in many cases the poor quality 
type did not clearly indicate which word was intended. On a timed test 
such as this one, a handicap was imposed upon all pupils except those with 
the keenest vision. In the lower grades careful attention should be given 
to the quality of pictures and illustrations used. In the earlier days it was 


"A good example is: Gertrude H. Hildreth and Harold Н. Bixler, Manual for Inter- 
preting the Metropolitan Achievement Tests. Yonkers-on-Hudson, New York: World Book 
Company, 1948. 122 pages. This manual is not given free with test orders, however, 
though its price is low. 


A SATISFACTORY MEASURING INSTRUMENT 131 


common to have the instructions to the examiner appear on the test booklet 
in the hands of the pupil. This practice not only meant a needless cost in 
paper and printing, but was possibly a source of confusion to the pupil. 

Commercial publishers of tests have not as yet given sufficient attention 
to devising tests which will reduce to the minimum their cost in time and 
money. There appears to be no valid educational reason why most tests 
should not be designed with separate answer sheets, a practice which not 
only eliminates the economic waste of using the test one time only and then 
discarding it, but which also greatly facilitates the scoring and makes the 
pupil's test profile available in convenient form for use and filing. Little is 
gained, however, when the answer sheet itself is sold at prices almost. as 
high as the test itself, as is now often the case. The customer usually gets 
what he wants when he wants it badly enough and makes his wishes known. 
To make convenient, inexpensive tests feasible, moreover, the demand 
must be sufficiently increased so that the additional volume sold compen- 
sates for reduced profit per unit. 

Summary. What, then; are the earmarks of a good measuring instru- 
ment? In brief, a good test or other measuring instrument possesses three 
outstanding qualities: validity, reliability, and usability. In other words, 
a good test measures what it claims to, consistenily, and with a minimum е2- 
penditure of time, energy, and money. But always the first consideration is 
validity. The test must not only measure what it purports to, but in the 
case of achievement tests, it should purport to measure the really important 
outcomes of the educational process. No achievement test that fails to do 
this can be considered a satisfactory measuring instrument, whether made 
by the classroom teacher or purchased from a publisher of standard tests. 


E. Some Generalizations Regarding the Problem of 
Measurement 


The role of measurement in science in general and in education in par- 
ticular was set forth in the first chapter; the historical development of 
measurement in education was traced in the second chapter; quantitative 
concepts were discussed in the third chapter; and the characteristics of a 
satisfactory measuring instrument were described in the present chapter. 
In the light of these discussions a few important generalizations will now 
be attempted. 


1. Some kind of measurement or evaluation 2s inevitable in education. This 
generalization is amply supported by the history of every recognized sci- 
ence, and of education itself, regardless of whether it is to be classified as a 
full-fledged science or not. 

2. All measurement is subject to error. This is true of the so-called ‘exact 
sciences”; and to a greater degree it is true of the less exact or newer social 


132 THE PROBLEM OF MEASUREMENT 


sciences, such as psychology and education. Westaway, for example, think- 
ing mainly of physics and chemistry, concludes: “We may, in fact, look 
upon the existence of error in all measurements as the normal state of 
things.” “ Kelley speaks of the “ubiquitous probable error”! in psychology 
and education. These errors can be reduced but never wholly eliminated. 

3. These errors of measurement are due in part to the imperfection in the 
measuring instruments available. There are no perfect measuring instru- 
ments, even in the physical sciences. Westaway, for example, says that 
“even the very best of the instruments with which we perform our measure- 
ments are imperfect.” This is true of the fundamental units of measure- 
ment in the physical sciences, as well as of the biological and social sciences, 
No astronomer knows precisely the velocity of light, and yet the light year 
is the yardstick of celestial measurement; no chemist knows the precise 
value of a single atomic weight, and yet it is the basic unit in chemical 
analysis. In psychology and education these imperfections are an even more 
potent source of error than in the older sciences. However, it must be 
remembered that the tools of measurement are much better than they used 
to be. 

4, The limitations of the methods used are a still more important source of 
error in measurement. Again this difficulty is true of the physical sciences 
as well as of the social sciences. For example, Max Planck says that in 
physics “every measurement, however exact, inevitably involves certain 
errors of observation.” These errors are due partly to sensory and tem- 
peramental defects, and partly to lack of skill in the observer. But a still 
more troublesome source of error is the tendency for the act of observation 
to interfere with the phenomena being observed in measurement. Heisen- 
berg, for example, noted that the “measurement of an electron’s velocity 
is inaccurate in proportion as the measurement of its position in space is 
accurate, and vice versa,” 5! owing to the disturbing influence of the light 
rays falling on it in the act of measurement. From this discovery resulted 
the famous “uncertainty principle” or the “principle of indeterminacy,” 
which has profoundly influenced modern physics. “Аз a matter of fact 
every measurement,” says Planck, “whatever the method of its employ- 
ment, invariably interferes more or less with the event to be measured." 5? 
But this interference is so slight as to be only of theoretical interest to the 
laboratory physicist engaged in the study of aggregates of elements instead 
of individual electrons. And the ordinary Newtonian principles of chem- 


“Е, W. Westaway, Scientific Method: Its Philosophical Basis and Ils Modes of Appli- 
cation, pages 289-290. New York: Hillman-Curl, Inc., 1937. 

48 Truman Lee Kelley, op. cit., page 19. Led 

Е. W. Westaway, op. cit., page 286. 

% Max Planck, The Philosophy of Physics, page 24, New York: W. W. Norton & 
Company, Inc., 1936, | 

5! Ibid., page 62. 

52 Ibid., page 68. 


A SATISFACTORY MEASURING INSTRUMENT 133 


istry and physies still operate in the usual way in such practical realms as 
engineering and medicine. 

But the disturbing effect of the measurement process is more senos in 
education. The personality of the examiner, as well as the testing materi:ls, 
is always part of the test situation. This is recognized in giving individual 
tests, where a proper rapport between examiner and subject is regarded as 
essential to a successful examination.™ But here, even with skilled examiners, 
the factor is rarely eliminated altogether, for it has been found that the IQ 
remains more stable when the same examiner gives all the tests. In all ex- 
periments, whether involving the use of individual or group tests, the sub- 
jects are not purely naive and receptive creatures but are actuated by 
motives of pride, desire to please or make a good impression on the ex- 
aminer, and the like. In other words, the examiner or experimenter is an 
important part of the situation, and it is doubtful whether standardized 
instructions can ever reduce this part to the point at which it is negligible. 
The factor is especially important in personality measurement and in the 
evaluation of social behavior, Certainly one would hardly expect to get as 
normal reactions of love-making in the psychological laboratory during the 
day as he would if he were concealed in a tree beside a bench in the park 
during the evening. 

5. Teachers and school administrators must not only understand and ap- 
preciate the functions of measurement in education, but they must realize more 
fully the limitations of present measuring instruments. In the present state 
of measurement two erroneous attitudes are sometimes found. The first is 
that held by certain over-enthusiastic supporters of measurement, who make 
unreasonable claims for existing measuring instruments, and who gloss over 
or refuse to recognize the imperfections that exist. This attitude is not 
unlike that of the adolescent in his first love affair, where, indeed, if love 
is not actually blind, it deliberately closes its eyes; and in any event the 
result is the same. This point of view is unfortunate and unintelligent, for 
it stands in the way of progress toward needed improvements; making such 
unwarranted claims is the surest way to discredit the movement with 
thoughtful people. Fortunately, this attitude appears to be on the de- 
cline.5 

53 For a stimulating discussion of the practical implications of this uncertainty prin- 
ciple by a distinguished American chemist, see: Irving Langmuir, ‘Science, Common 
Sense, and Decency,” Science News Letter, 43: 3-4, 12-15, January 2, 1943. 

54 Elinor L. Sacks, “Intelligence Scores as à Function of Experimentally Established 
Social Relationships Between Child and Examiner,” Journal of Abnormal and Social 
Psychology, 47: 354-358, April Supplement, 1952. 

55 For a suggestive analytical discussion of the problem, see: Douglas E. Scates, 
“Differences Between Measurement Criteria of Pure Scientists and of Classroom Teach- 
ers,” Journal of Educational Research, 37: 1-13, September, 1943, Т 

5 Ross recalls having heard a gray-haired southern educator say, regarding intelli- 
gence teste, soon after World War I: “The worst enemies of any new cause are its darn 
foo! friends!" 4 

вт Picturesque pleas for sanity in using tests by two pioneers in test-construction are 


134 THE PROBLEM OF MEASUREMENT 


But a second and equally erroneous attitude goes to the opposite ex- 
treme. It characterizes those who are as blind to the virtues of existing 
measuring instruments as the first group are to their limitations, and who 
refuse to have anything at all to do with tests and examinations until all 
defects are forever removed. This attitude is as unreasonable as that of the 
farmer who has postponed buying an automobile “Ш them blamed things 
is perfected," and who has in the meantime worn out a great deal of shoe 
leather without seeing much of the world either. 

Then there is the third attitude, that of the practical person who has 
learned through experience not to expect perfection. Moreover, he has 
found that excellent work can often be turned out with imperfect tools, 
if only they are used with sufficient skill. He has also discovered that 
greater skill is called for than if the instruments were perfect, and he sets 
out deliberately to attain the skill needed. He realizes that the very exist- 
ence of these imperfections imposes a special obligation upon the user to 
seek to understand as fully as possible their nature in order to get desired 
results in spite of them. Furthermore, he makes a conscious effort in inter- 
preting and using test results in order to take into account the existence of 
errors. In other words, he takes the very common-sense point of view that 
the proper thing to be done under the circumstances is to make the best 
possible use of such tools as exist, while waiting for better ones to be 
developed. 


SELECTED REFERENCES гов FURTHER READING 


Bennett, George K., Seashore, Harold G., and Wesman, Alexander G., Differential 
Aptitude Tests Manual (Second Edition). New York: The Psychological Corpora- 
tion, 1952. Chapter 4, “Validity,” and Chapter 5, “Reliability.” 

Brownell, William A. (Chairman), “Тһе Measurement of Understanding," Forty- 
Fifth Yearbook of the National Society for ihe Study of Education, Part 1. Chicago: 
University of Chicago Press, 1946. 338 pages. 

Davis, Frederick B., “Item Analysis in Relation to Educational and Psychological 
Testing," Psychological Bulletin, 49: 97-121, March, 1952. 

Dressel, Paul L., Problems of Evaluation in General Education,” Proceedings of 
the 1951 Invitational Conference on Testing Problems, pages 45-57. Princeton, 
New Jersey: Educational Testing Service, 1952. 

DuBois, Philip H., Achievement Tests in Personnel Selection," American Journal 
of Public Health, 41: 567-572, May, 1951. 

Fan, Chung-Teh, Item Analysis Table. Princeton, New Jersey: Educational Testing 
Service, 1953. 32 pages. 

Flanagan, John С. (Chairman), “Constructing Examinations So That They Will 
Be Valid Measures of Important Functions,” Proceedings of the 1948 Invitational 
Conference on Testing Problems, pages 18-42. Princeton, New Jersey: Educational 
Testing Service, 1949. Papers by Oscar K. Buros, Max D. Engelhart, Warren G. 
Findley, Harold Gulliksen, Charles R. Langmuir, and Phillip J. Rulon. 


made in S. A. Courtis’ "Let's Stop This Worship of Tests and Scales," Nation's Schools, 
31: 16-17, March, 1943; and in Guy M. Wilson’s “Some Subversive Activities of the 
Test Expert," Educational Method, 21: 342-343, April, 1942. 


A SATISFACTORY MEASURING INSTRUMENT 135 


Freeman, Frank S., Theory and Practice of Psychological Testing. New York: 

Dey Hon and Company, 1950. Chapter 1, “Basie Principles: Theoretical and 
Jlinical." 

Guilford, Joy P., Fundamental Statistics in Psychology ana Education (Second 
Edition). New York: McGraw-Hill Book Company, 1950. Chapter 17, “Relia- 
bility of Measurements,” and Chapter 18, “Validity of Measurements.” 

Lindquist, E. F. (Editor), Educational Measurement. Washington, D. C.: American 
Council on Education, 1951. Chapter 5, “Preliminary Considerations in Objective 
Test Construction," by E. F. Lindquist; Chapter 9, “Item Selection Techniques,” 
by Frederick B. Davis; Chapter 13, “The Essay Type of Examination,” by 
John M. Stalnaker; Chapter 14, “The Fundamental Nature of Measurement,” 
by Irving Lorge; Chapter 15, "Reliability," by Robert Г. Thorndike: and 
Chapter 16, "Validity," by Edward E. Cureton. 

Lindquist, E. F., “Some Criteria of an Effective High School Testing Program,” 
pages 17-33 in Arthur E. Traxler (Editor), “Measurement and Evaluation in the 
Improvement of Education,” American Council on Education Studies, Series I, 
No. 46, Vol. XV, April, 1951. Washington, D. C.: American Council on Edu- 
cation. 

Odell, C. W., How to Improve Classroom Testing. Dubuque, Iowa: William C. Brown 
Company, 1953. Chapter I, Introduction," and Chapter П, “Objectives.” 

Thorndike, Robert L., Personnel Selection: Test and Measurement Techniques. New 
York: John Wiley & Sons, 1949. Chapters 4, 5, and 8: “The Estimation of Test 
Reliability," “The Estimation of Test Validity: Criteria of Proficiency,” and 
“Ttem Analysis and Selection of Items.” 

Thorndike, Robert L. (Chairman), “Criteria for the Evaluation of Achievement 
Tests," Proceedings of the 1950 Invitational Conference on Testing Problems, 
pages 73-112. Princeton, New Jersey: Educational Testing Service, 1951. Papers 
by John B. Carroll, Frederick B. Davis, Harold Gulliksen, and Joseph J. Schwab. 

Thorndike, Robert L., “Tests as Research Instruments,” Review of Educational 
Research, 21: 450-462, December, 1951. 

Travers, Robert M. W., How to Make Achievement Tests. New York: The Odyssey 
Press, 1950. Chapter 1, "Introduction." 

Traxler, Arthur E.; Jacobs, Robert; Selover, Margaret; and Townsend, Agatha, 
Introduction to Testing and the Use of Test Results in Public Schools. New York: 
Harper & Brothers, 1953. Chapter 1, “A Point of Departure," and Chapter 2, 
“What Do Tests Contribute to Understanding the Individual Pupil?" 

Vernon, Philip E., The Structure of Human Abilities. New York: John Wiley & Sons, 
1950. Chapter IV, “Analyses of Educational Attainments.”’ 

Weitzman, Ellis, and McNamara, Walter J., Constructing Classroom Examinations 
—a Guide for Teachers. Chicago: Science Research Associates, 1949. Chapter 1, 
“Basie Aspects of Achievement Tests.” 


PART П 


The Construction 
of 
Teacher- Made Tests 


5 


General Principles of Test Construction 


Importance of the problem. There are at least three reasons why the 
development of proficiency in constructing informal teacher-made tests is 
important. In the first place, the vast majority of tests in use by classroom 
teachers are of this type.! In the second place, both essay examinations 
made and marked by untrained teachers and objective tests used by ordi- 
nary classroom teachers may produce highly unsatisfaetory results. The 
extensive literature on essay examinations, briefly summarized in Chap- 
ter 2, has repeatedly demonstrated this fact. Amateurs may at times do 
even worse with objective tests than with essay examinations. Incredible 
as it may be, it does seem possible to make objective tests of lower reli- 
ability than essay examinations.” In the third place, both logical consid- 
erations and statistical analyses indicate that skillfully prepared informal 
tests are as reliable and as valid as some standardized tests. In fact, where 
the teaching conditions are unusual, or where the subject matter is not 
thoroughly stabilized, as in civics and modern history, such tests may be 
even more valid. A state-wide survey of high-school achievement conducted 
in Tennessee,’ for example, showed that only 56 per cent of the questions 
in the standardized social studies test then in use could be answered from 
the state-adopted textbook. 

This chapter will consider the general principles of constructing informal 


! Iven H. Hensley and Robert A. Davis, “What High-School Teachers Think and Do 
about Their Examinations,” Educational Administration and Supervision, 38: 219-228, 
April, 1952. 

"t John M. Stalnaker, “The Essay Type of Examination," Chapter 13 in E. Е. Lind- 
quist (Editor), Educational Measurement. Washington, D. C.: American Council on 
Education, 1951. 

$ Jos. E. Avent, Report of the Tennessee State Testing Program, page 83. Nashville: 
State Department of Education, 1946. 

139 


140 THE CONSTRUCTION OF TEACHER-MADE TESTS 


teacher-made tests grouped under the following four headings, which indi- 
cate roughly the steps or stages in the process: 


1. Planning the test. 
2. Preparing the test. 
3. Trying out the test. 
4, Evaluating the test. 


A. Planning the Test 


It should be recognized at the outset that the construction of satisfactory 
measuring instruments is one of the most difficult duties the teacher has 
to perform. Good tests do not just happen. Nor are they the result of a few 
moments of high inspiration or exaltation. On the eontrary the process is 
calm, deliberate, and time-consuming. Perhaps the best that can be hoped 
for under existing conditions is that the teacher prepare reasonably com- 
prehensive and adequate informal tests in one subject each year. Best 
results wil usually be obtained from cooperative effort. The procedure 
employed in developing the Cooperative Achievement Tests, outlined on 
page 112, is a good illustration. Another example is the cooperative plan 
used by 19 colleges in constructing examinations.’ A third illustration is the 
procedure followed by the Evaluation Staff of the Eight-Year Study spon- 
sored by the Progressive Education Association.’ The six major steps in the 
process as set forth in detail by Smith and Tyler may be summarized 
briefly as follows: 


1. The faculty of each school was asked to formulate a careful statement of its 
edueational objectives. 

2. Statements from these thirty schools were classified by the Evaluation Staff 
into ten major types of objectives. 

3. Each type of objective was then defined in terms of expected pupil behavior. 

4, Situations were suggested in which pupils could be expected to show the 
particular kind of behavior. 

5. The more promising methods of obtaining evidence regarding each type of 
objective were then selected from existing techniques or devised by the staff, and 
subjected to experimental trial. 

6. The methods which made the best showing in this preliminary trial were 
further developed and improved. 

7. Means were devised for the interpretation and effective use of the various 
instruments of evaluation. 


It is recognized that the process just described is too elaborate for the 
ordinary school, or for the individual teacher working on his own. However, 


‘Paul L. Dressel, “Problems of Evaluation in General Education," Procéedings of the 
1961 Invitational Conference on Testing Problems, pages 45-57. Princeton, New Jersey: 
Educational Testing Service, 1952, 

5 Published in five volumes by Harper & Brothers, New York, 1942, under the gen- 
eral title Adventure in American Education. 

* Eugene R. Smith, Ralph W. Tyler and Evaluation Staff, Appraising and Recording 
Student Progress, Chapter I. New York: Harper & Brothers, 1942, 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 14 


it cannot be emphasized too strongly that the actual process of test con- 
struction must be preceded by careful planning if the test is to succeed. 
The test will be no better than the quality of the thinking that goes into it. 
In planning the test, consideration must be given to the nature of the ob- 
jective to be measured, the purpose it is to serve, and the conditions under 
which it will be used. 

1. Adequale provision should be made for evaluating all the important out- 
comes of instruction. A careful statement of the philosophy of the school 
and the objectives of the particular course should be available from the 
start. A survey? of a representative sample of 1660 state, county, and city 
courses of study revealed that only 13 per cent contained no statement of 
objectives. To be of maximum helpfulness in either teaching or testing, 
the objectives should be stated as specifically as possible. The expected 
pupil behavior must be indicated. It is not enough to say that the objective 
is "good citizenship” or “an integrated personality." These large indefinite 
terms must be broken down and stated in usable form. 

With the list of teaching objectives for the course clearly and specifically 
stated, the teacher is ready to consider what procedures will be most ap- 
propriate for evaluating progress made toward the attainment of each 
objective. In other words, the teacher attempts to test what he has tried 
to teach by using techniques best adapted to each objective. 

One writer’ suggests that the objectives of instruction may be grouped 
into eight major categories: 


. Functional information 

. Various aspects of thinking 

. Attitudes 

. Interests, aims, purposes, appreciations 
. Study skills and work habits 

. Social adjustment and social sensitivity 
. Creativeness 

. A functioning social philosophy. 


Another classification® contains ten major types: 


Qo -1 c» бл WHE 


1. The development of effective methods of thinking 

2. The cultivation of usetul work habits and study skills 

3. The ineuleation of social attitudes 3 

4. The acquisition of a wide range of significant interests К 

5. Тһе development of increased appreciation of musie, art, literature, and 
other aesthetie experiences Г 

6. The development of social sensitivity — t 

7. The development of better personal-social adjustment 

8. The aequisition of important information 


? B. E. Leary, A Survey of Courses of Study and Other Currwulum Materials Published 
Since 1934. Washington: United States Office of Education, 1938. 1 

8 Louis E. Raths, "Evaluating the Program of a School," Educational Research 
Bulletin, 17: 51-84, Mareh 16, 1938. 

? Smith, Tyler, and staff, op. cit., page 18. 


142 THE CONSTRUCTION OF TEACHER-MADE TESTS 


9. The development of physical health 
10. The development of a consistent philosophy of life. 


For any given course these objectives must be expressed in terms of the 
specific changes in pupils which the teacher is seeking to bring about. 
A rather detailed inventory of the particular facts, principles, concepts 
and skills of the course is required, as well as the specific mental processes 
the pupil is expected to employ. To measure whether these processes are 
really functioning, the teacher's inventory just mentioned must be pre- 
sented to the pupils in language different from that of the text and class 
diseussion, and opportunities must be offered to apply or to relate the 
objectives to new problems and situations. The center of gravity is the 
behavior of pupils rather than subject matter. The teacher must not con- 
fuse ends and means. The true relationship has been stated as follows: 
“The real ends of instruction are the lasting concepts, attitudes, skills, 
abilities and habits of thought, and the improved judgment or sense of 
values acquired; the detailed materials of instruction—the specific factual 
content—are to a large extent only a means toward these ends.” 10 

A group of English teachers, for example, were able to recognize seven 
different aspects of “appreciation of literature.” They then suggested the 
following ways in which these aspects of appreciation may manifest them- 
selves in pupil behavior: 


1. Satisfaction in the Thing Appreciated: Appreciation manifests itself in a feeling, 
on the part of the individual, in keen satisfaction in, and enthusiasm for, the thing 
appreciated. The person who really appreciates a given piece of literature finds in 
it an immediate, persistent, and easily renewable enjoyment of extraordinary 
intensity. 

2. Desire for More of the Thing Appreciated: Appreciation manifests itself in an 
active desire on the part of the individual for more of the thing appreciated. The 
person who really appreciates a given piece of literature is desirous of prolonging, 
extending, supplementing, renewing his first favorable response toward it. 

3. Desire to Know More about the Thing Appreciated: Appreciation manifests 
itself in an active desire on the part of the individual to know more about the thing 
appreciated. The person who really appreciates a given piece of literature is desirous 
of understanding as fully as possible the significant meanings which it aims to 
express and of knowing something about the genesis, its history, its locale, its 
sociological background, its author, etc. 


1 E, F. Lindquist, “The Use of Tests in the Accreditation of Military Experience and 
in the Educational Placement of War Veterans,” Educational Record, 25: 366, October, 
1944. 

и Louis Raths, “Appraising Certain Aspects of Student Achievement," T'hirty-Seventh 
Yearbook of the National Society for the Study of Education, Part I, pages 114-115. 
Quoted by permission of the Society. Bloomington, Illinois: Public School Publishing 
Company, 1938. This reference contains some very suggestive tests designed to measure 
appreciation of literature, attitudes held toward important social issues, and important 
aspects of thinking. Also see William A. Brownell (Chairman), “Тһе Measurement of 
Understanding," Forty-Fifth Yearbook of the National Society for the Study of Education, 
Part I, Chicago: University of Chicago Press, 1946. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 143 


4. Desire to Express One’s Self Creatively: Appreciation manifests itself in an 
active desire on the part of the individual to go beyond the thing appreciated, to 
give creative expression to ideas and feelings of his own which the thing appreciated 
has chiefly engendered. The person who really appreciates a given piece of literature 
is desirous of doing for himself, either in the same or in a different medium, some- 
thing of what the author has done in the medium of literature. 

5. Identification of One’s Self with the Thing Appreciated: Appreciation manifests 
itself in the individual’s active identification of himself with the thing appreciated. 
The person who really appreciates a given piece of literature responds to it very 
much as if he were actually participating in the life situations which it represents. 

6. Desire to Clarify One's Own Thinking with Regard to the Life Problems Raised 
by the Thing Appreciated: Appreciation manifests itself in a desire on the part of 
the individual to clarify his own thinking with regard to specific life problems 
raised by the thing appreciated. The person who really appreciates a given piece 
of literature is stimulated by it to rethink his own point of view toward certain of 
the life problems with which it deals and perhaps subsequently to modify his own 
practical behavior in meeting these problems. 

7. Desire to Evaluate the Thing Appreciated: Appreciation manifests itself in a 
conscious effort on the part of the individual to evaluate the thing appreciated in 
terms of such standards of merit as he himself, at the moment, tends to subscribe 
to. The person who really appreciates a given piece of literature is desirous of dis- 
covering and describing for himself the particular values which it seems to hold 
for him, 


Arnold” has shown that critical thinking can be taught in the elementary 
school and that various phases of the process can be measured. His study 
assumed that critical thinking involves the intelligent use of data, which 
was defined as the "ability to recognize relevance, dependability, bias in 
source, and adequacy of data in regard to a particular problem, question, 
or conclusion.” The following item has to do with the recognition of bias 
in data: 

Three boys were talking about whether or not a boy Jim was really “out” in a 
game of baseball that had been played that afternoon. John was on Jim's side. 
George was on the other side. Bill was not playing but was watching the game. 
Which of the three boys is most likely to be right? .. i 
Why? Е NM ете 


Recognition of the adequacy of data was measured by such test situa- 
tions as the following: 


Some people were talking about a ball team. Someone asked the others to tell 
why they thought this team was good. Here are the answers. Read them carefully. 
Put В before the one of these three that you think is the best reason for thinking 
this is a good ball team. Put P before the one you think is poorest, and put F 
before the one you think is just fair. 

Ev 1. 1 think this is a good ball team. 
2. I have seen them play once, and I think they are good players. 


__ 3. I have seen them play several times, and they are good players. 


Заз L. Arnold, "Testing Ability to Use Data in the Fifth and Sixth Grades," 
Educational Research Bulletin, 17: 255-259, 278, December 7, 1938. 


14 THE CONSTRUCTION OF TEACHER-MADE TESTS 


Read these next two. Find the better reason of the two; place a B before it. 
Find the poorer reason; place a P before it. 
_..... 1, A man who studies baseball and writes much about it said they were a 
good team. 


EX 2. A man I talked to on the street the other day said they were good players. 


Some boys were talking about a boy they called D. Jim said, “I saw D take 
a pencil from another pupil's desk. This makes me think he is a thief.” If this is all 
that Jim knew about this, do you think Jim is right in thinking that D was a thief? 


Put a line under your answer: YES NO AM NOT SURE 
Now tell why you answered as you did. 


It must be recognized, moreover, that some of the objectives of instruc- 
tion cannot be measured by paper-and-pencil tests of this kind. At times 
rating seales, check lists, and other devices for recording observations of 
the individual at work or play are required. The term “test” in this discus- 
sion includes any instrument that affords valid evidence of progress made 
by pupils toward the attainment of the objectives of instruction. 

One of the least tangible objectives of instruction is creativeness. Grimes 
and Bordin't have proposed that creative expression in art should result 
in the development of certain personality traits. These traits would be 
evaluated by the art teacher through observation during a conversation 
with his pupils. This process is guided by a check list upon which the 
teacher’s record is entered. Grimes and Bordin suggest that this technique 
would be more valuable if a group of teachers co-operated in the construc- 
tion of a check list of their own, rather than adopted wholesale the list 
given below: 

I. Initiative—vwillingness to go into the unknown, to start off on a new track, to 
attempt something never attempted before; perseverance after recognizing a dead 
end; willingness to try again. 

1. Attempts a medium, technique, or subject never attempted before. 

2. Does not accept as final the view of the subject which he happens to have in 
the place where he begins, but moves around and views subject to be painted 
from many angles before deciding where to work. 

3. Does not take for granted the posed object to be painted, but views the total 
situation and sees what, for him and his experience, there is in it that is 
paintable. 


4. Assists in posing the model or arranging still life, and the like. 


13 For stimulating discussions of this general topic, see Joy P. Guilford, “Creativity,” 
American Psychologist, 5: 444-454, September, 1950, and Louis L. Thurstone, ‘Creative 
Talent,” Chapter 2 in Applications of Psychology. New York: Harper & Brothers, 1952. 

14 James W. Grimes and Edward Bordin, “А Proposed Technique for Certain Evalua- 
tions in Art," Educational Research Bulletin, 18: 1-5, 29, January 4, 1939. 


о 


10. 


п. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 145 


. Brings material to be painted, as still-life objects, and the like, to the studio 


from outside sources. 


. Starts to work rather than depending on the teacher for instructions as to how 


to proceed. 


. Does not mind making mistakes but pursues the work despite reverses and 


difficulties. 

a) Serapes off in painting in oils the thick paint when canvas is gummed up; 
in water color washes out, when these steps are necessary for continued 
work on the painting. 

b) Does not regard initial drawing for a composition as absolute, but moves 
shapes as the developing experience demands adjustments and changes. 


. Participates in class diseussions and contributes ideas and experiences. 
. Pursues some meaningful aetivity as sketching, if he finishes before the others 


in the class, rather than stalling around. 


Does not demand approval or supervision from the teacher or other students 
at almost every step in his work. 

Places his work away from himself, changes viewpoint in order to get a more 
objective view of his work. 


П. Concentration, Interest, Motivation—vigor with which an individual attacks 
a problem and his oneness of purpose which would result in excluding factors 
nonrelevant to a given problem (perseverance is implied here). 


i 
. Does not come and go himself, 

. Does not idly converse about matters exterior to the situation. 

. Does not let the work of others distract him from his own problems. 

. Does not quickly come to a dead end in his own work. 

. Does not stall around pretending to be working on a proiect. 

. Does work outsidg of the class that has bearing on class work, as: go to gallery, 


мо си m 05 to 


со 


Not distracted by others coming and going or talking. 


sketch, draw, and consult reproductive material. 


. Works on painting after hours. | 


9. Requests information as to work relative to his development. 


18. 


. Contributes to elass discussion. 
. Talks to friends about his work and attempts to explain what he has been 


accomplishing and learning. 


. Paints the same subject more than once; does not give out quickly as regards 


subjeet matter. 
Does not continually consult the time. 


Ш, Judgment—weighing the factors in a situation and taking them into ac- 
count before initiating a new action; that is, considering the possible results of 
an action before the initiation of it, seeing the social implications of a proposed 
action. Implied in this is knowing when to go off into the unknown and when not 
to; knowing when to pursue an independent course and when not to. 


E 


Does not aid others to the point of interfering with the progress of his own 
work or that of others. 


146 THE CONSTRUCTION OF TEACHER-MADE TESTS 


2. Attempts to understand the point of view of others rather than thinking of 
ways to justify himself. 

3. Selects a position to work which does not obstruct another's view of the 
subject. 

4. Does not talk so loudly that it distracts others who are working. 

5. Takes into consideration the desires and interests of others when arranging a 
subject to be painted. 

6. Analyzes his working situation in relationship to time, when there is a group- 
determined time limit. 

7. Knows when to pursue a project further and when to discard it and start 
another. 

8. Uses materials and cares for them efficiently—cleans palettes, washes brushes, 
and the like. 


9, Works plastically; that is, he allows for working with the forms rather than 
setting down an arehiteetural plan as a rigid drawing which is filled in with 
color. Shows evidence of an exploratory and feeling-out attitude rather than 
a rigid method of working. 

10. Examines critically criticism by others and makes use of it only so far as he 
feels it significant; that is, he reacts to it in terms of its validity rather than 
emotionally. 


11. Takes into consideration the needs of others when using group materials. 

12. Avoids vacillation in following out his own painting rather than shifting in 
style, execution, and attitude, as he sees others in the class going in a direction 
different from his own. 


IV. Co-operation—the willingness to work 1n a group as a member of ii in rela- 
tionship to the teacher, individual members, and the whole group. 


1. Makes use of criticism, does not react to it as personal insult, or cry, show 
anger, or leave class. 
2. Is willing to alter his personal objectives to meet the situation. 


It must be kept in mind that the objectives of the course represent 
directions of progress rather than destinations to be arrived at by individual 
pupils at any particular time. As far as possible, the progress of each indi- 
vidual should be measured in terms of his own interests, needs, and abilities. 
This is the aim of the modern school. The degree to which it is actually 
attained in any particular situation is dependent upon the resources avail- 
able as well as upon the educational philosophy and skill of the teaching 
staff. 

2. The test should reflect the approximate proportion of emphasis in the 
course. To insure a reasonable balance in the test, it is essential to draw up 
in outline form a sort of “job analysis" or “table of specifications." 15 This 
will guide the test maker much as the architect’s blueprint and specifica- 

15 For a systematic approach to specifying the content of individual items see John C. 


Flanagan, “The Use of Comprehensive Rationales in Test Development," Educational 
and Psychological Measurement, 11: 151-155, Spring, 1951. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 147 


tions guide the building contraetor. It is well to indicate not only the var- 
ious objectives the teacher has had in mind, but also, at least roughly, the 
relative amount of emphasis each objective has received in the actual teach- 
ing of the course. For example, the same test might not be equally valid 
for two teachers of a course in general science using the same textbook. 
This would be the case if one teacher emphasized almost altogether the 
memorization of isolated facts, while the other was much more concerned 
with the understanding of facts in relation to other facts, and in their ap- 
plication to practical problems in the community. The test should attempt 
to reflect faithfully the teaching emphasis. The amount of time devoted by 
the teacher to the various topical divisions of the course is a rough indica- 
tion of what he considers to be their relative importance. The content of 
the test should show a similar proportion. The time devoted to a topic can 
at best indicate only the number of items to be included, and not the type 
of the items. The type of items to be used will depend upon the nature 
of the objective to be measured. A topical outline is only a partial guide to 
test construction. The table of specifications should also indicate the ap- 
proximate teaching emphasis from the standpoint of knowledge, skills, at- 
titudes, and other types of objectives that have been sought. 

3. The nature of the test must take into consideration the purpose it is to 
serve. Any test is valid to the degree that it serves a specific purpose. If the 
purpose of the test is to afford a basis for school marks or for classification, 
it will attempt to rank the pupils in order of their total achievement. But 
if the purpose of the test is diagnosis, its value will depend upon its ability 
to reveal specific weaknesses in the achievement of individual pupils. Diag- 
nostic tests would cover a limited scope but in much greater detail than 
a test of general achievement, and would be arranged so as to reveal the 
scores on the separate parts. The range of difficulty of the items and the 
discriminating value of the items individually are relatively less important 
in diagnostic tests. This is also true of mastery tests administered at the end 
of a teaching unit to determine when the minimum essentials have been 
achieved. 

4. The nature of the test must take into consideration the conditions under 
which it is to be administered. In planning the test, attention must be given 
to such factors as the time available for testing, the facilities for duplicating 
the tests, and the cost of the materials, as well as the age and experience 
of the pupils being tested. 


B. Preparing the Test 


The second step is the actual preparation of the test. It has been found 
from experience that the following rules or suggestions are helpful: 

1. The preliminary draft of the test should be prepared as early as possible. 
Many teachers find it desirable to jot down items to be included in the tests 
day by day as the teaching progresses. This is reasonable assurance that 


148 THE CONSTRUCTION OF TEACHER-MADE TESTS 


no important point in the course will be omitted in the test. If this is not 
done, the supplementary material of the course which is not included in the 
textbook and which may be of unusual value is especially likely to be over- 
looked, This practice also permits the material to “grow cold" and conse- 
quently to be more correctly appraised before it is included in the final 
draft of the test. 

2. The test may include more than one type of item. A variety of test types 
is likely to be more interesting to the pupil than a single form. This is 
especially true of long tests. Moreover, the requirement that the type of 
test situation should be the one which is most appropriate to the material 
to be included will sometimes necessitate that three or four forms of ob- 
jective items be used. These objective items are frequently combined with 
one or more discussion questions to make up the test. 

3. Most of the items in the final test should be of approximately 50 per cent 
difficulty after being “corrected for chance” by the procedure described in 
footnote 32 on page 160. That is, about half of the group should “know” 
the answer to each item, while the other half should not. This requirement 
cannot be met very closely in the typical school situation, however, because 
item difficulties in the preliminary form of the test will vary considerably. 
There will usually be too few items to permit discarding those not close to 
the 50 per cent mark. A suggested rule-of-thumb method for constructing 
the final (post-tryout) form is this: For motivational purposes, let Items 1 
and 2 be so easy that almost nobody will miss them. Put aside (do not use) 
all other items whose correct answer was “known” by less than 16 per cent 
or more than 84 per cent of the students in the tryout group. Then let 
Item З be the easiest of the remaining items, one known" by about 84 per 
cent of the persons tested. Arrange the other items in ascending order of 
difficulty, with the hardest ones at the end of the test. 

The above discussion implies that for maximum discrimination the diffi- 
culty of the entire test should be such that, when allowance is made for 
chance, the average pupil in the group makes about 50 per cent of the pos- 
sible score. It is clear, then, that a test which is of ideal difficulty for оле , 
class may be too easy or too difficult for other classes. ‘ 

Perhaps one of the worst defects of most teacher-constructed tests is 
failure to make the items difficult enough, probably in large measure be- 
cause of the influence of the “70 per cent is passing" tradition. In order to 
pass all but a few students, the teacher who grades on a percentage-right 
basis must build tests for which the average score is 80 per cent or more, 
thereby causing the items that make up the test to be too easy for efficient 
measurement, 

A few exceptions to this principle should be noted. In speed tests for such 
subjects as arithmetic and typewriting, where the objective is rate rather 
than power, all items should be rather easy. Also, in both mastery and diag- 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 149 


nostic tests the content is determined primarily by the importance of the 
subject matter rather than its difficulty. An adequate diagnostic test in the 
fundamental combinations in addition, for example, might yield many 
almost perfect scores in a strong class, and scores well below 50 per cent 
in a weak class. 

4. It is usually desirable to include more items in the preliminary draft of 
the test than will be needed in the final form. This will permit “culling out,” 
later on, items that may appear weak or not needed to produce the proper 
balance in the test. For each subdivision of the test, from 25 to 50 per cent 
more items should be prepared than are likely to be required. 

5. After some time has elapsed, the test should be subjected to a critical 
revision. Then the items should be carefully checked with the table of spec- 
ifications to see that the test shows the desired emphasis upon the various 
topics. A careful reading of the test after an interval of time will usually 
reveal some objectionable items. It is a good plan to have the test criticized 
by other teachers of the same subject. In this way some items are likely 
to be found which cover points of doubtful importance, others which are 
not clearly stated, and perhaps others about which there is disagreement. 
as to the answers. The wording of the items should receive critical atten- 
tion, particularly to avoid ambiguity. One serious error is the wording of 
items so that more than one reasonable interpretation is possible. The 
trouble with such ambiguous items is that a certain answer is correct with 
one interpretation, but with another interpretation a different answer is 
reasonably correct. 

6. The items should be so phrased that the content, rather than the form of 
the statement, will determine the answer. A common mistake is to include a 
telltale word or phrase that affords an unwarranted clue to the answer. 
These so-called specific determiners are especially common in true-false 
items. It has been found that statements containing emphatic words, such 
as the adverbs “always,” “never,” “entirely,” “absolutely,” “exclusively,” 
and the like, are much more likely to be false than true. On the other hand, 
Words or expressions that limit the statement, such as “may,” "sometimes," 
“as a rule," “in general," and the like, are much more likely to be true than 
false. Either these expressions should be avoided entirely, a suggestion 
which is rarely feasible, or items containing them should be carefully bal- 
anced so that approximately the same number are true as false. Avoiding 
the language of the text will prevent pupils with good rote memories from 
answering items they may not understand. Sometimes clues are afforded 
by the spelling or by the grammatical form of the item. It is not unlikely 
that one of the reasons why many pupils prefer objective tests to other 
types is that such tests often contain items so worded as to be answered 
from a minimum knowledge of the subject matter involved. Such defects, 
however, are not inherent in objective testing; they can be avoided by the 


150 THE CONSTRUCTION OF TEACHER-MADE TESTS 


alert test maker. Administering the test to persons unfamiliar with the 
content of the course will often reveal those items which ean be answered 
from general intelligence or from a general knowledge of language forms 
and usage. 

The opposite mistake is often made also. Figurative language, needlessly 
heavy vocabulary, or involved sentence structure may so obscure the mean- 
ing of an item that it is marked incorrectly by pupils who really understand 
the point. Bob Burns' story of the time Grandpa Snazzy was a witness in 
court illustrates this error: 


The attorney says: “Now, Mr. Snazzy, did you or did you not, on the date in 
question or at any time previously or subsequently, say or even intimate to the 
defendant or anyone else, whether friend or mere acquaintance or in fact a total 
stranger, that the statement imputed to you, whether just or unjust and denied 
by the plaintiff, was a matter of no moment or otherwise? Answer—did you or 
did you not?" 

Grandpa thinks а while and then says, “Did I or did I not what?" 


Unless the test aims specifically to measure reading ability or general 
intelligence, the form of the item should neither impose unreasonable ob- 
stacles in the pupil's way nor provide clues which are too obvious. Both 
defeat the purpose for which the test was intended. 

7. The items should be so worded that the whole content functions in deter- 
mining the answer, rather than only a part of it. There is often a wide dis- 
crepancy between what actually determines the pupil’s response to a test 
and what the teacher intended. One of the principal reasons for this dis- 
crepancy is that only a part of the content of the item functions, the rest 
being wholly inert as far as the pupil is concerned. Lindquist" gives some 
excellent examples of this difficulty. Two of these, shown below, should 
make the problem clear. Note the first: 


The leader in the making of the compromise tariff of 1833 was (1) Clay, 
(2) Webster, (3) Jackson, (4) Taylor, (5) Harrison. 


"That the majority of the pupils who responded to this item correctly did 
so on the superficial basis of the strong verbal association between the 
words "compromise" and “Clay” is evidenced by the fact that fewer than 
half of them responded correctly when the item appeared in the following 
form: 

The leader in the tariff revision of 1833 was (1) Clay, (2) Webster, (3) Jackson, 
(4) Taylor, (5) Harrison. 


That the matching type of test is also subject to this error is shown by 
the next illustration: 


18 Herbert E. Hawkes, Е. F. Lindquist, and C. В. Mann, The Construction and Use of 
Achievement Examinations, pages 73-81, Boston; Houghton Mifflin Company, 1936, 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 151 


Directions: Below are two columns of items. Match the items in the two columns 
by placing on the line before each group of words in Column A the right number 
from Column B. 


Column A Column B 
....l. a Phoenician contribution to civilization. 1. Mason and Dixon 
..9. most famous building of the ancient Greek Line 


world. 
. the fleet whose defeat in 1588 gave England 2. Spanish Armada 
the control of the Atlantic Ocean. 


e 


‚.4. a boundary between two colonies that later 3. Saratoga 
became famous as the division between free 
and slave territory. 4. Dred Scott Decision 
РН 5. the victory which caused France to come to 
our aid during the Revolutionary War. 5. Parthenon 
..6. the law that forbade slavery north of the 
Ohio River. 6. Missouri 
..7. a ruling by the Supreme Court which Compromise 


opened all territory to slavery. 
7. Alphabet 


8. Printing Press 
9. Ordinance of 1787 


In most of the above items a single word gives the clue. For example, 
“boundary” in item 4 suggests “line” in response 1. Likewise, either 
"ruling" or “court” in item 7 suggests "decision" in response 4. If a pupil 
knows that “armada” means “fleet,” he would be able to match item 3 
with response 2 without knowing the date, the country, or the event in- 
volved. It should be noted, furthermore, that probably the above test 
would still be poor, even if each item were well worded, because the items 
included are so diverse in character. 

The test maker should attempt to anticipate the specific mental proc- 
esses the pupil will employ in each response. For each item the teacher 
should raise such questions as the following: Are there any parts of the 
item that the pupil may disregard entirely and yet respond correctly? 
What is the minimum amount of knowledge required for a correct response? 

8. All the items of a particular type should be placed together in the test. 
Sometimes completion, true-false, and multiple-choice items of varying 
numbers of choices are thrown together in random order. This arrangement 
is rarely, if ever, desirable. It is good practice to place together the items 
of similar type. Such an arrangement not only facilitates the scoring of the 
test and the interpretation of the scores, but enables the pupil to-take full 
advantage of the mind-set imposed by a particular item form. 

9. The items in the test should be arranged in ascending order of difficulty. 
It is especially important to have the easiest items at the beginning and 
the hardest ones at the end of the test. It will be recalled that one of the 
problems of measurement is to arrange conditions so that the thing being 


152 THE CONSTRUCTION OF TEACHER-MADE TESTS 


measured is disturbed as little as possible in the act of measuring. The 
psychological justification for placing the easiest items first is that such an 
arrangement has a wholesome effect upon the morale of the pupils taking 
the test. On the other hand, placing very difficult items at the beginning 
is likely to produce needless discouragement in the pupils, particularly with 
those of average ability and below. If the most difficult items come toward 
the end of the test, only the more capable pupils will probably get to them. 
After all, the only function of such items is to discriminate among the high- 
ranking pupils. In any event, any disturbing influence on the weaker pupils 
will eome too late to affect the results seriously. 

In advance of an actual tryout of the test, it is impossible to determine 
anything more than a rough estimate of the true difficulty order of the 
items, unless one is willing to go to the trouble of obtaining the pooled 
judgment of three or more persons." The judgment of a single experienced 
teacher regarding the difficulty of the items is likely to have some validity. 
In any case it is usually possible to pick out those items that will be at the 
extremes of the scale; and fortunately this is what is needed most. In later 
revisions of the test, the items can be placed in more exact order of diffi- 
culty. 

10. A regular sequence in the pattern of responses should be avoided. The 
order of correct responses should be a chance order rather than a regular 
pattern. If items are arranged alternately true and false, or two true and 
two false, for example, the pupil is likely to discover the arrangement. 
To facilitate scoring, it is sometimes suggested that multiple-choice items 
be so arranged that the option numbers of the correct responses give com- 
binations easy to remember, such as a date like 1453. But there is always 
risk that the pupil will “get the hang” of the pattern and answer success- 
fully without considering the content of the item at all. 

11. Provision should be made for a convenient written record of the pupils 
responses. Such a record is a check list, a rating scale, or some other similar 
form upon which the observer makes a systematic and permanent record 
of a pupil’s behavior under a given set of conditions. It is particularly 
difficult to provide a satisfactory written record of responses on oral 
quizzes.” In the ordinary test the pupil makes his own record in writing 
either on the test paper or a specially prepared answer sheet. The problem 
then is merely that of arranging the test so that the labor of scoring will be 
reduced to a minimum. Such devices as numbering or lettering the re- 

Sherman Tinkelman, Difficulty Prediction of Test I tems, page 49. New York: Bu- 
reau of Publications, Teachers College, Columbia University, 1947. Further investi- 
gation of this problem was undertaken by Irving Lorge and Lorraine Kruglov, “A` 
Suggested Technique for the Improvement of Difficulty Prediction of Test Items,” 
Educational and Psychological Measurement, 12: 554-561, Winter, 1952. 

18 Scarvia B. Anderson, "Sequence in Multiple Choice Item Options," Journal of 
Educational Psychology, 43: 364-368, October, 1952. 


‘* Max M. Kostick and Belle M. Nixon, “How to Improve Oral Questioning,” Peabody 
Journal of Education, 30: 209-217, January, 1953. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 153 


sponses in multiple-choice items and the blanks in completion items, so 
that the responses will be recorded in a column rather than scattered ir- 
regularly over the page, save time and reduce the chances of error in scor- 
ing. Merely grouping the items by fives, rather than spacing them uni- 
formly, reduces somewhat the eyestrain in scoring the test. 

12. The directions to the pupil should be as clear, complete, and concise as 
possible. The aim should be to make the instructions so clear that the 
weakest pupil in the group knows what he is expected to do, although he 
may not be able to do it. The pupil should be told how and where to mark 
the items, the time allowed to do so, and any reduction for errors to be made 
in scoring. The amount of detail required will depend upon the maturity 
of the pupils and their experience with that particular type of test. To very 
young children, for example, it will be better to say “draw a line under" 
rather than “underline,” and “draw а ring around the right answer" rather 
than “encircle the correct response.” In the lower grades it is usually de- 
sirable for the teacher to read the directions aloud to the pupils while they 
follow silently the written directions on their test papers. Wherever the 
form of the test is unfamiliar or complicated, a generous use of samples 
correctly marked and fore-exercises or practice tests that do not count in 
determining the score is to be recommended. Sometimes a blackboard 
demonstration is the best way to make the procedure clear. As the pupils 
become familiar with the various types of items and the procedure used in 
scoring them, the directions may be greatly abridged. 

A single illustration should make these points clear. The following direc- 
tions may be considered reasonably satisfactory for a class unfamiliar with 
objective tests: 

Directions то тне Рори: Below are thirty statements about measurement in 
education. Examine each statement and decide whether it is true or false. In 
the (..) before each statement you think is true, put +; in the ( ) before each 
statement you think is false, put 0. You will nave ten minutes for the test. Your 
score will be the number right minus the number wrong. You may mark an answer 
even when you have only a slight hunch, but do not guess wildly, Study the samples 
below. They are answered correctly. : 

SAMPLES: 

(0) A. High reliability insures high validity in a test. 

(+) B. Group tests of intelligence originated in America. 

After the pupils have become familiar with true-false tests and the 
method employed in scoring them, the directions may be shortened to a 
form somewhat as follows: 

Directions: In the ( ) before each item put + if true, and 0 if false. You will 
have ten minutes for the test. 


One other point warrants consideration. Should pupils be told or en- 
couraged to guess at items about whose answers they are in doubt? Some 
authorities would require the pupils to attempt all items on recognition 


154 THE CONSTRUCTION OF TEACHER-MADE TESTS 


tests. They would include some such statement as, “If you do not know, 
guess!" Others would go to the other extreme and say, “Do not guess!” 
Still others, perhaps the majority, would be content with informing the 
pupil that the correction formula? is to be employed and let him use his 
judgment about attempting doubtful items. Unfortunately, the experi- 
mental evidence on this point is neither extensive nor altogether convincing. 
Most of the studies have merely attempted to compare the relative effect 
of the first two practices upon the validity and reliability of the scores, 
without eonsidering the third possibility at all. 

The results have usually favored do-not-guess-wildly instructions by a 
slight margin. Davis is particularly opposed to forced guessing?! 


To force students to mark answers to items based on reading passages available 
for reference is undesirable, but to force them to mark answers to items testing 
specific subject matter that they do not know and cannot figure out is far worse. 
Tt is not only frustrating to the students but it goes contrary to good teaching 
practices and compels the students to break habits of carefulness that the schools 
try hard to inculcate. Once in a while it is possible that students might be told 
that as part of an experiment they are required to mark every item in a test even 
when they have no idea what to mark, but in systematic testing programs this 
would be inadvisable as well as impractical. It might eliminate variations in the 
number of omissions and thus wipe out some of the effects of differences in per- 
sonality, but it would do this at a cost of antagonizing teachers and frustrating 
students. It would also introduce additional chance variance into the scores. 


However, Votaw,” Lentz, Cronbach,” and others have found some 
evidence of a factor of “acquiescence” operating in taking tests, since do- 
not-guess instructions placed ascendant students at an advantage over 
submissive students. It is argued that such instructions reduce the validity 
of achievement tests, since they become in some degree measures of per- 
sonality traits. Some investigators have also found that good students tend 
to improve their scores when they attempt doubtful items, whereas poor 
students do not. 

All things considered, the authors offer the following recommendations: 


a. The use of multiple-choice tests with fewer than four responses to each 
item should be avoided wherever possible. 


20 This formula is discussed on pages 156-158. 

21 Frederick B. Davis, "Item Selection Techniques,” in E. F, Lindquist (Editor), 
Educational Measurement, pages 274-275. Washington, D. C.: American Council on 
Education, 1951. 

* David Е. Votaw, “The Effect of Do-not-guess Directions upon the Validity of 
True-false or Multiple-choice Tests," Journal of Educational Psychology, 27: 698-703, 
December, 1936. 

28 Theodore Е. Lentz, "Acquiescence as a Factor in the Measurement of Personality," 
Psychological Bulletin, 35: 659, November, 1938. 

24 Lee J. Cronbach, "Studies of Acquiescence as a Factor in the True-false Test," 
Journal of Educationai Psychology, 33: 401-415, September, 1942; "Further Evidence 
on Response Sets and Test Design," Educational and Psychological Measurement, 10: 
3-31, Spring, 1950. р 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 155 


b. Regardless of the number of possible responses, the score should prob- 
ably be the number right on all tests used with pupils below the junior- 
high-school level.” 

c. When tests with only two or three possible responses to each item are 
used with pupils above the sixth grade, the correction formula should be 
employed. 

d. Whenever the correction formula is to be used, the pupils should be 
so informed. 

e. The reason for the correction formula should be discussed with pupils 
before the test begins. They should then be allowed to use their best judg- 
ment, without being specifically advised by the directions to guess or not 
to guess. 


C. Trying Out the Test 


After the test has been prepared according to plan, it is ready to be given 
a trial in actual use. Since it is impossible in advance to know exactly how 
good the test is or to locate all the poor items, the tryout should be con- 
sidered a necessary step in constructing the test in its final form. The fol- 
lowing four principles should govern the tryout. With the possible excep- 
tion of the second, these principles are equally applicable to the later use 
of the test in its final form. 

1. Every reasonable precaution should be taken to insure normal conditions 
for the test. This is important because the responses to any test are partly 
determined by the conditions under which it is given, as well as by the test 
itself. It is usually well to have the test administered to the pupils in the 
familiar environment of their own classroom. Any tendency to cheat should 
be forestalled by careful supervision. Where cheating is likely to be a special 
problem, pupils may be so seated that every other seat is vacant, or the test 
items may be arranged in different orders for pupils seated close together. 

2. The time allowance for the test should be generous. This is more impor- 
tant in the tryout than in the later use of the test in its final form. One 
reason for this is that the items are arranged at best in only a rough order 
of difficulty, and, if the time allowance is too short, pupils may not have 
time to try items toward the end of the test, which they may be capable of 
answering correctly. Short time allowances should be avoided, therefore, 
in order to secure the data needed for determining the difficulty and the 
discriminating value of the items. What time allowance is to be considered 
generous will depend upon the purpose of the test and upon the ability and 
experience of the pupils. For example, it is obyious that the time limits of 
speed tests should be so short that even the best pupil does not have time 
to finish the test. On the other hand, more time should be allowed for diag- 


?5 Admittedly this decision is based on practical considerations, rather than on experi- 
mental evidence. It is usually difficult to explain to young children the logic of the 
formula, and they are likely to be suspicious of what they do not understand. 


156 THE CONSTRUCTION OF TEACHER-MADE TESTS 


nostic tests than for tests of general achievement; and tests of a purely 
factual character can be answered more quickly than those involving the 
higher mental processes. 

Lindquist suggests that, in general achievement tests, the time allowance 
should be so adjusted that “at least 75 per cent of the pupils will have time 
at least to consider all items in each section."?* Ruch seemed to favor time 
limits “so that 90 per cent can attempt all items within their power." 
In accordance with this standard, Ruch suggested that for fairly short items 
of a factual character, three recall or four recognition items per minute is 
a "reasonable expectancy for upper-elementary and high-school pupils." 
For reasoning tests the corresponding time allotments would be increased 
for recall items to one or two items per minute, and for multiple-choice 
items to two or three items per minute. Younger pupils and longer or 
harder items would demand still more time. 

The above standards have in mind the requirements for the ordinary use 
of the test in its final form, rather than for the tryout, for which more time 
should be allowed. Since so many factors influence the time demands of a 
partieular test, the writers suggest that in the tryout sufficient time be 
allowed so that all, or almost all, the pupils have time to finish. If the ex- 
aminer will record during the progress of the test the percentage of the 
pupils who are still at work after various amounts of time have elapsed, 
this information will be useful in determining the time allowances for later 
revisions of the test. 

3. The scoring procedure adopted should be fairly simple. As a rule, the best 
procedure in scoring objective tests is to give one point of credit for each 
correct response. In multiple-choice tests this means one point for each item 
properly marked, and in recall tests it means one point for each blank cor- 
rectly filled. It is unnecessary to weight the items according to estimated 
difficulty or importance. Even in essay examinations weighting is much less 
important than is ordinarily assumed. Almost all pupils will be in the same 
rank order regardless of the weighting of the individual items. 

The correction-for-chance formula actually corrects for omissions rather 
than for guessing alone. If no items are omitted by the students, or if they 
all omit the same number of items, their relative scores will be the same 
regardless of whether or not it is employed. The general formula is usually 
written: 

Ww 


таза 


2 Herbert E. Hawkes, Е. Е. Lindquist, and C. В. Mann, op. cit., page 116. 

7 G, M. Ruch, The Objective or New-Type Examination, page 312. Chicago: Scott, 
Foresman & Company, 1929. 

2 Cf. Alexander J. Phillips, “Further Evidence Regarding Weighted Versus Un- 
weighted Scoring of Examinations,” Educational and Psychological Measurement, 3: 
151-155. Summer, 1943. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 15 


In this formula 
S is the score corrected for guessing. 
R is the number of right responses. 
W is the number of wrong responses, not counting omitted items. 
0 is the number of options presented for each item. 


For two-option or true-false items this becomes 


S-R-W. 
For three-option items the formula is 

58 = Б – +0. 
For four-option items the formula is 

S=R- 40. 
For five-option items it is 

S=R- W. 


If the items have six or more options, it is probably not worth while to 
correct the “rights” scores for chance. 

The above formulas are designed to reduce to zero the score of each 
person who is totally ignorant concerning the material as presented in the 
test and who guesses at the right answers with a degree of success depend- 
ent upon the number of options each item has. For example, if the test 
contains 100 true-false items and if the testee guesses at each of these, 
he should on the average answer about 50 items “correctly,” since there is 
one chance in two that he will mark any item right purely by accident. 
Thus the expected score for a wholly uninformed person would be 50 right 
and 50 wrong. However, he richly deserves a final score of zero, which 
represents his knowledge of the material covered, so from the 50 rights are 
subtracted the 50 wrongs: R — W = 50 — 50 = 0. 

On the other hand, if he answers 50 items correctly and omits the 
other 50, his score will be 50 — 0 = 50. Presumably, had he tried the 50 
omitted items he would have answered half of them (25) correctly by 
chance and missed the other 25, making his rights score 50 known + 25 
guessed = 75 and his wrongs score 25. 75 — 25 = 50, the same score he 
would have secured without any guessing. There is a fallacy in this argu- 
ment, though, for a student is not likely to know the answers to half the 
questions and be absolutely ignorant concerning the other half. More 
likely, he has various degrees of partial information and misinformation 
concerning a considerable number of items. In many test situations these 
two types of information seem to cancel each other out, thereby making 


the R — n formula suitable.” 


3 Frederick B, Davis, op. cit, page 211. 


158 THE CONSTRUCTION OF TEACHER-MADE TESTS 


It should be emphasized that, strictly speaking, the correction formula 
is needed only when some students have omitted a fairly large number of 
items, while others have omitted few. Otherwise, the ranks of the students 
will be unchanged, regardless of whether or not their scores are corrected 
for "chance." For psychological reasons the teacher may report corrected 
scores to the students, even though few items have been omitted. This is 
especially advisable with true-false tests, where the poorest students may 
not realize the extent of their ignorance and hence may protest if given low 
grades, unless the R — W formula is employed. 

If all individuals tested answer every item, the standard deviation of 


D = 1 times the standard deviation of the 
rights scores. From this relationship it is apparent that correcting total 
scores from a “‘do-guess” true-false test doubles their standard deviation: 
О + (0 — 1) = 2 divided by (2 — 1) = 2. Likewise, if there are no omits, 
the standard deviation of corrected-for-chance five-option-item test scores 
is 4 = 1$ times the standard deviation of the uncorrected scores. 

The authors recommend using the correction formula with true-false and 
two- and three-option multiple-choice test scores, even when omissions are 
negligible, in order to emphasize the range of knowledge within the group 
tested. 

4. Before the actual scoring begins, answer keys and scoring rules should be 
prepared. In teacher-made objective tests satisfactory scoring keys can be 
prepared by simply filling in the correct responses, preferably with a colored 
pencil, on one of the unused tests. Scoring then consists of comparing the 
pupil's responses with those on the key placed beside his paper. In essay 
examinations the key consists of a model paper containing a complete set 
of answers, together with the points to be allowed on each. Definite rules 
are necessary to secure uniformity in scoring. The rules for scoring objective 
tests usually say merely that one point will be allowed for each correct 
response and that no fractional credits will be allowed, and indicate whether 
or not the correction formula will be used. The rules for essay examinations 
give the weight for each question, and tell whether or not any deductions 
are made for errors in spelling, language usage, and so forth. In mathe- 
maties tests the rules should cover such points as whether or not the 
answers must be reduced to lowest terms, whether or not credit will be 
allowed for solutions correct in principle but with the wrong answer, and 
the like. 

If the students’ answers have been recorded on a special answer sheet, 
such as the International Business Machines general-purpose answer sheet, 
then a punched-out cardboard key will speed up scoring a great deal. One of 
the most useful of the various answer sheets is shown in Figure 5.3 


scores corrected for chance is 


* Answer sheets and the key-punch may be purchased from International Business 
Machines Corporation, Endicott, New York. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 159 


D. Evaluating the Test 


After the papers have been scored, the results should be interpreted and 
evaluated from two points of view: first, as to the quality of the test itself; 
and, second, as to the quality of the pupils’ responses. While the ultimate 
interest of the test maker is in the light thrown by the test results upon the 
quality of the teaching and organization that exists in the school, his first 


t 
g 


SEX 


AGE 


INSTRUCTOR. 


DATE OF BIRTH. 
GRADE OR CLASS. 
NAME OF TEST. 


SCORES 


DATE. 


1000 A 44$ 


NAME. 


Figure 5. IBM General-Purpose Answer Sheet, 


160 THE CONSTRUCTION OF TEACHER-MADE TESTS 


concern should be the quality of the test used. Only tests of high merit 
afford suitable information regarding the school situation. To what extent, 
then, does the test possess the three characteristics of satisfactory measur- 
ing instruments, validity, reliability, and usability? Only the last of these 
can be confidently determined in advance. The five principles that follow 
are suggested for evaluating the test from the viewpoint of its validity and 
reliability." If the test is found to possess these qualities in high degree, 
the scores should then be carefully analyzed for their value in instruction 
and school administration. If the test is found to lack these qualities, the 
scores can be disregarded and the test subjected to a thorough revision. 
. No matter how carefully the test is prepared in the first place, its merits 
should be established and not merely assumed. 

1. The difficulty of the test is a rough indication of its adequacy. The diffi- 
culty of the test as a whole is determined by finding what percentage the 
average score made (corrected for chance, if appropriate) is of the maximum 
possible score. In general achievement tests, the nearer this average is to 
50 per cent, the better. The difficulty of the individual test items is obtained 
by finding the percentage of successful responses for each item, usually cor- 
rected for “chance” in the same manner as the total score.” Items answered 
by 100 per cent or by 0 per cent of the pupils are of no value in a test of 
general achievement. The difficulty of the test is relatively unimportant 
in mastery tests and in diagnostic tests. 

2. The internal consistency of the individual items in the test is determined 
by their ability to discriminate between pupils who rank high: and those who 
rank low on the test as a whole. There are several methods of determining 
this. Only the simplest of these methods are practical for use with informal 
tests. A satisfactory procedure for the classroom teacher is to determine 
the number of correct responses (or of incorrect responses) to each test item 
by the pupils who rank in the highest 27 per cent of the class on the test 
as a whole, and to compare this with the corresponding number in the 


3l For a brief but suggestive report, see: Ellis Weitzman and Walter J. McNamara, 
“Techniques Used in Analyzing the Learning Achievement of Naval Aviation Cadets," 
Journal of Educational Psychology, 35: 181-185, March, 1944. 

22 Proportion of testees who “know” the answer to an item 


number who marked it incorrectly ) 


- (sumber who marked it correctly — os - 
number of options item has minus one 


divided by the total number of testees. In symbols,. 
Я еш 
CNET 
p= f 
N 
This formula is appropriate only when nearly all examinees have had time enough to 
try the item. Otherwise, see Frederick B. Davis, “Item Selection Techniques,” pages 
278-280 in E. F. Lindquist (Editor), Educational Measurement. Washington, D. C.: 
American Council on Education, 1951. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 161 


lowest 27 per cent of the class.” The items in which the number of correct 
responses of the high group exceeds that of the low group by the largest 
amount are best; those in which numbers are the same are useless; and 
those in which the number of correct responses of the high group falls 
behind that of the low group are detrimental. Items showing zero or nega- 
tive discrimination should be either reworded or thrown out altogether. 

3. It is a good practice to have the items interpreted or criticized by persons 
who have taken the test. It is impossible to anticipate fully all the mental 
processes pupils will employ in responding to a test item. These can be 
determined only by making inquiry of pupils who have taken the test. 
In this way irrelevancies and ambiguities will be revealed that were wholly 
unsuspected by the maker of the test. Often a slight change in wording is 
sufficient to remedy the difficulty. At other times the item must be entirely 
discarded. If a test contains too many of these items, the scores on the test 
should not be counted in determining the pupil’s record in the class. Invit- 
ing members of the class to assist in this critical evaluation of the test may 
help to create a favorable attitude toward the measurement process em- 
ployed by the instructor, and is a valuable educational experience in itself. 

4. Whenever possible, the results on the test should be checked against an 
outside criterion. For short tests covering small units of subject matter, 
this process is likely to be difficult and of little value. Even here it is some- 
times helpful to compare the ranks of the pupils on the test with those 
assigned by the teacher before the test is given. The validity of the longer 
and more important tests can be determined in a more satisfactory manner 
by comparing the scores of the pupils on each test with their scores on a 
good standard test covering the same material and given at about the same 
time. The coefficient of correlation obtained between the two series of 
scores is the most exact method of expressing the amount of agreement, 
although a rough indication can be obtained by comparing the percentage 
of scores which lie in the same fourths of the two series of scores. 

5. It is sometimes desirable to obtain. the reliability coefficient of the test. 
The authors recognize that it is easy to overestimate the value of the reli- 
ability coefficient. The makers of standardized tests have often made this 
mistake. However, the reliability coefficient does have some merit in eval- 
uating informal tests, although the value is mainly negative. Low reliability 
coefficients indicate tests of doubtful merit, but high reliability coefficients 
per se do not establish the value of the tests. To be of real value these co- 
efficients must be supported by other criteria.” 

33 The reasons for selecting contrasting groups from the 27 per cent at the extremes 


of the distribution are given by Truman L. Kelley, “The Selection of Upper and Lower 
Groups for the Validation of Test Items.” Journal of Educational Psychology, 30: 17-24, 
January, 1939, d А j 

MAC woe item analysis of a classroom test is set forth in Appendix B, pages 


436-458. edt у х ien 
% In Appendix B, page 452, a simplified method for securing a one-form reliability 


coefficient is applied to a typical classroom examination. 


162 THE CONSTRUCTION OF TEACHER-MADE TESTS 


The construction of an informal teacher-made test, then, involves these 
four steps: planning, preparing, trying out, and evaluating. It is perhaps 
more correct to say that these activities constitute a cycle in the construc- 
tion of a test, for it is often necessary to repeat these steps, particularly the 
last three, several times before the test is brought to its finished form. 


SELECTED REFERENCES FOR FURTHER READING 


Adkins, Dorothy C., and others, Construction and Analysis of Achievement Tests. 
Washington, D. C.: U. S. Government Printing Office, 1947. 292 pages. 

Cronbach, Lee J., Essentials of Psychological Testing. New York: Harper & Brothers, 
1949. 475 pages. 

Davis, Frederick B., “The AAF Qualifying Examination," Army Air Forces Avia- 
tion Psychology Research Report No. 6. Washington, D. C.: U. 8. Government 
Printing Office, 1947. 266 pages. 

Gardner, Eric F., “Development and Applications of Tests of Educational Achieve- 
ment in Schools and Colleges," Review of Educational Research, 23: 85-101, 
February, 1953. 

Goheen, Howard W., and Kavruck, Samuel, Selected References on Test Construction, 
Mental Test Theory, and Statistics, 1929-1949. Washington, D. C.: U. S. Govern- 
ment Printing Office, 1950. 209 pages. 

Goodenough, Florence L., Mental Testing: Its History, Principles, and Applications. 
New York: Rinehart & Company, 1949. Chapter 8, “The Analysis and Selection 
of Test Items." 

Jordan, A. M., Measurement in Education: An Introduction. New York: McGraw- 
Hill Book Company, 1953. Chapter 2, “Characteristics of Measuring Instru- 
ments." 

Lindquist, E. F., “Preliminary Considerations in Objective Test Construction,” 
Chapter 5 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Odell, C. W., How to Improve Classroom Testing. Dubuque, lowa: William C. Brown 
Company, 1953. Chapter IV, “Test Construction: General." 

Stanley, Julian C., *A Simplified Item-Analysis Procedure," American Psychologist, 
6: 369, July, 1951. 

Stanley, Julian C., “А Simplified Method for Estimating the Split-Half Reliability 
Coefficient of a "Test," Harvard Educational Review, 21: 221-224, Fall, 1951. 

Stanley, Julian C., *“ ‘Psychological Correction for Chance,” Journal of Experi- 
mental Education, 22: 297-298, March, 1954. 

Travers, Robert M. W., How to Make Achievement Tests. New York: Odyssey Press, 
1950. 180 pages. 

Travers, Robert M. W., “Rational Hypotheses in the Construction of Tests,” 
Educational and Psychological Measurement, 11: 128-137, Spring, 1951. 

Traxler, Arthur E.; Jacobs, Robert; Selover, Margaret; and Townsend, Agatha, 
Introduction to Testing and the Use of Test Results in Public Schools. New York: 
Harper & Brothers, 1953, Chapter 2, “What Do Tests Contribute to Understand- 
ing the Individual Pupil?” 

Weitzman, Ellis, and McNamara, Walter J., Constructing Classroom Examinations 
—A Guide for Teachers. Chicago: Science Research Associates, 1959. Chapters 1 
and 2, “Basic Aspects of Achievement Tests" and “Steps in Classroom Testing.” 


6 


Principles of Constructing Specific Types 
of Objective Tests 


A. Introduction 


Types of objective tests. The principal types of objective test items 
used by classroom teachers may be listed as follows: 


1. Recall types. 
a. Simple-recall. 
b. Completion. 


2. Recognition types. 

a. More common. 
(1) Alternative-response. 
(2) Multiple-choice. 
(3) Matching. 

b. Less common. 
(1) Rearrangement. 
(2) Identification. 
(3) Analogy. 
(4) Incorrect statement. 

This chapter will consider the uses and limitations of the commonly used 
forms of objective tests and suggest rules which have been found to be of 
value in constructing them. It will also give illustrative items in a variety 
of fields, drawn mainly from standard tests. 

Frequency of use by teachers. Two early studies present data on the 
frequency of use by classroom teachers of various forms of test items. In the 
first of these studies Conneau! analyzed 45,418 test items that appeared 

1 Summarized by G. M. Ruch, The Objective or New-Type Examination, pages 188-190. 


Chicago: Scott, Foresman & Company, 1929. 
163 


164 THE CONSTRUCTION OF TEACHER-MADE TESTS 


in 375 objective examinations submitted in a prize contest. This study 
doubtless represented the practice of superior teachers in 1928, rather than 
that of average teachers. In 1936 Lee and Segel? reported an analysis of the 
types of informal tests used by 1,600 high-school teachers distributed 
widely over the United States. That there is rather surprising agreement 
between these studies is indicated by Table 25. In both studies the comple- 


TABLE 25 


Ranxines оғ Test Irems ACCORDING TO FREQUENCY OF UsE As REVEALED 
BY Two STUDIES 


Type of Item Conneau Lee and Segel 
Completion 1 1 
True-false..... 2 2 
Multiple-choice. . 3 4 
Vin i Mec NE 11 5 
PrOD oMi ла 7 6 
Matehing e c E e а 4 7 


tion form ranks first and the true-false second. Conneau grouped all recall 
forms under completion, while Lee and Segel separated out the one-word- 
answer type. This type of item, which ranked third in the latter study, 
has not been included in the table. The next most popular item is the 
multiple-choice form. The most striking disagreement is in the relative rank 
of the essay examination. In the earlier study only 0.6 per cent of the ques- 
tions were of the essay type, while in the more recent study 16 per cent 
of the teachers appear to be using that type extensively. This apparent 
revival of interest in the essay examination is probably less marked than 
the difference in ranks between the two studies would indicate, since the 
earlier tests were written for prize competition. In fact, Lee and Segel? 
conclude that there was a definite shift toward objective tests. 

Davis and Hensley‘ report that “[high school] teachers prefer a combi- 
nation of essay and objective questions. Fifty-nine per cent of the teachers 
use a combination of question types. Of the total group four per cent use 
the essay type question exclusively and thirty-eight per cent use objective 
examinations only.” 

Comparative validity and reliability of various types of tests. 
Ruch* summarized the experimental studies available in 1929 and came 


* J. Murray Lee and David Segel, Testing Practices of High School Teachers, pages 
6-12. United States Office of Education Bulletin, No. 9, 1936. 

3J. Murray Lee and David Segel, op. cit., page 6. 

*Iven H. Hensley and Robert A. Davis, “What High-School Teachers Think and Do 
About Their Examinations,” Educational Administration and Supervision, 38: 219-228, 
April, 1952. Page 220. 

5G. M. Ruch, ор. cit., pages 281-306. 


SPECIFIC TYPES OF OBJECTIVE TESTS 165 


to the conclusion that “the new-type tests are at least as valid as the essay 
examinations," and that the various objective types are “not greatly un- 
equal in validity." Ruch also concluded that “for equal working times 
recall and recognition types are not greatly dissimilar,” although recall 
tests tended to rank at the top and true-false at the bottom in most of 
the studies. 

During the next ten years several experimental studies and excellent 
summaries of the literature were published. Those by Kinney and Eurich* 
and by Lee and Symonds? were the most comprehensive. The latter study 
points out that the problem of determination of the comparative merits of 
different measuring instruments is not only “one of the most important” 
it is also “опе of the most poorly done." 

Rinsland? also summarized the experimental literature to 1938 and sug- 
gested two cautious conclusions: 


1. One might conclude that the objective tests, with probably the exception of 
the true-false type, are as valid as, or perhaps slightly more valid than, the essay 
or subjective examination; and that, of all the objective forms. the completion or 
simple recall seems to be the most valid. 

2. Generally speaking, the various types of objective tests have about equal 
reliability when’ compared on the basis of working time. Differences of reliability 
may be due primarily to the wording of individual items rather than to the objective 
form. 


Lindquist? takes the position that many of the studies which have at- 
tempted to determine the comparative validities and reliabilities of various 
test forms have been “inconclusive, if not definitely misleading.” He points 
out that these comparative studies have not always recognized the specific 
nature of test validity, have overemphasized the importance of reliability, 
and have often failed to control such factors as relative skill in constructing 
the various test forms and the time allotments for the tests. In view of 
these limitations Lindquist comes to the conclusion that “in making a selec- 
tion from a number of test techniques in any specific test situation or in 
relation to any specific objective of instruction, the test constructor must, 
at present, depend almost entirely upon logical considerations rather than 
upon the experimental or empirical evidence that is now available.” 

A summary published in 1950” concludes that “few dependable gen- 


* L, B, Kinney and A. C. Burich, “A Summary of Investigations Comparing Differ- 
ent Types of Tests,” School and Society, 36: 540—544, October 22, 1932. 

7J. Murray Lee and Percival М. Symonds, “New Type of Objective Tests: A Sum- 
mary of Recent Investigations," J ournal of Educational Psychology, 24: 21-39, February, 
1933; 25: 161-184, March, 1934. TN . 

з Henry Daniel Rinsland, Constructing Tests and Grading in Elementary and High 
School Subjects, pages 295-299. New York: Prentice-Hall, Inc., 1988. —— 

? Herbert E. Hawkes, E. F. Lindquist, and C. R. Mann, The Conatruction and Use of 
Achievement, Examinations, pages 97-103. Boston: Houghton Mifflin Company, 1936. 

10 Max D. Engelhart, "Examinations," in the Encyclopedia of Educational Research, 
edited by Walter S. Monroe, pages 407-412. New York: The Macmillan Company, 


1950. 


166 THE CONSTRUCTION OF TEACHER-MADE TESTS 


eralizations can be drawn from studies in this area" and that the “оуег- 
lappings are much more significant than minor differences between aver- 
ages." Another article in the same volume" arrives at this conclusion: 


Adequate comparisons between test techniques can therefore be made only for 
specific material when results are used for a specific purpose, when items are con- 
structed with specific insight and ability, and on the basis of validity coefficients 
computed for equal amounts of testing time when each test is administered at its 
optimum rate. 


To be of practical guidance to the classroom teacher, research should 
seek answers to such specific questions as the following: In the measure- 
ment of what specific objectives in science is the true-false technique of 
most worth? What testing technique is most effective for measuring vocab- 
ulary in a foreign language? What distinctive value, if any, has the rear- 
rangement test in history? For the present, one’s choice of the tools of 
science must depend chiefly upon one’s personal judgment and general 
educational philosophy rather than upon direct experimental evidence. 

It is well to recognize that knowledge may exist and function on at least 
four different levels. 'The lowest level involves mere recognition. A person's 
general reading vocabulary, as distinguished from his speaking and writing 
vocabulary, is an example of knowledge where the ability to recognize is 
the important thing. The next higher level involves recall. For knowledge 
of many types to have value, one must be able to recall it when needed. 
Familiar examples are one’s speaking and writing vocabulary, the names 
and faces of acquaintances, and the ordinary number combinations in 
arithmetic. Sometimes one needs to recall separate facts or isolated bits 
of knowledge, but at other times the organization is important. The per- 
son who is an entertaining conversationalist, an interesting letter writer, 
or an effective public speaker must be able to present his knowledge in a 
connected form. A still higher level of knowledge involves the ability to 
interpret and evaluate. At this level the learner must have a sufficient under- 
standing of the material to be able to see it in its relationships to other 
things. The exercise of discrimination and judgment is implied. The highest 
level of all involves application. The person who is able to utilize informa- 
tion acquired in one situation and who applies it to the intelligent solution 
of problems in a new setting has arrived at true mastery. 

It seems reasonable to assume that the type of test used must be appro- 
priate to the level of knowledge being measured. Tests of the multiple- 
choice and matching types appear adequate for the first level of knowledge. 


1! Walter W. Cook, “Achievement Tests,” in the 1950 Encyclopedia of Educational 
Research, page 1468. 


SPECIFIC TYPES OF OBJECTIVE TESTS 167 


Recall tests may be required for the other three levels. Wherever organiza- 
tion is important, the essay type is perhaps more appropriate than the 
simple recall. However, far more important than the type of test is the skill 
with which it is used. Understanding, evaluation, application, and many 
other aspects of thinking can be measured by recognition tests, but to do 
so requires a degree of skill that the regular classroom teacher rarely at- 
tains.” It is also probably true that most recall tests measure memory only. 


B. Simple-Recall Tests 


Definition. The simple-recall test is here somewhat arbitrarily defined 
as one in which each item appears as a direct question, a stimulus word or 
phrase, or a specific direction. The response must be recalled by the pupil 
from his past experience rather than merely identified from a list of sug- 
gested answers supplied by the teacher. The simple-recall test is differen- 
tiated from the essay examination primarily upon the basis of length of 
response required; the typical response to the simple-recall item is short, 
preferably a single word or phrase. Thus it is sometimes called a short- 
answer objective test. 

Advantages and limitations. This type of test has the obvious ad- 
vantage of familiarity and “naturalness.” It may stimulate desirable study 
practices and almost completely eliminate guessing as a factor for measure- 
ment, thus avoiding two of the most common faults of objective tests. The 
simple-recall test is particularly valuable in mathematics and the physical 
sciences, where the stimulus appears in the form of a problem requiring 
computation. It also has wider application to test situations presented in 
the form of maps, charts, and diagrams in which the pupil is required to 
supply, in spaces provided, the names of parts keyed by numbers or letters. 

One limitation of the simple-recall test is that it tends to measure highly 
factual knowledge, consisting of isolated bits of information. Also the scor- 
ing is somewhat laborious and not always entirely objective. These limita- 
tions need not be very serious when the tests are carefully prepared, as can 
be seen from the illustrations which follow. 


I. Illustrations of Simple-Recall Tests 


Below are a few sample test items of the simple-recall form that have 
been taken from standard tests.” Excellent examples of this and other test 


12 For a hensive diseussion of this problem, see: William A. Brownell and 
Committees "The Mensurement of Understanding," Forty-Fifth Yearbook of the Na- 
tional Society for the Study of Education, Part 1. 338 pages. Chicago: University of 
Chicago Press, 1946. 1. 

18 Та the examples of the various types of objective tests that follow an effort has 
been made to illustrate a wide variety of mechanical arrangements of items as well as 
of subject matter. It is recognized that they are not all of equal merit. Some of the tests 
referred to are perhaps out of print, 


168. THE CONSTRUCTION OF TEACHER-MADE TESTS 


forms used in a variety of school subjects on all educational levels are to be 
found in Вала.“ 
Stone Reasoning Tests in Arithmetic! 


1, James had 5 cents. He earned 13 cents more and then 
bought a top for 10 cents. How much money did he have 


left? User yee са | 
2. How many oranges can I buy for 35 cents when oranges 
cost 7 cents each? Answer: 
Sones-Harry High School Achievement Test, Part IT: 
1. What instrument was designed to draw a circle? ....... ( 18 
2. Write “25% of” as “а, decimal times.”................ арче "ВАЗА 
3. Write in figures: one thousand seven and four hundredths (s Actes. E 


Cooperative General Mathematics Tests for College Students, Form 19341 
28. How many axes of symmetry does an equilateral triangle 


IRSE iun SUN SU а Ч ( ) 
29. Hignt is what per cent of 64?........................ ( ) 
30. Write an expression that exceeds M by X. ............ ( ) 
Зі. Solvexbe formula V = о ( ) 


Iowa Placement Examinations, Chemistry-Training® 


1, The atomic weight of K is 39; of Cl, 35.5; of O, 16. 
What is the molecular weight of KCl0;? ASI Sue 
2. If 7 gm. of iron unite with 4 gm. of sulphur, how many gm. 
of iron sulphide will be produced? — — eA a i as 
Tests on Everyday Problems in Science, Unit XII! 


What device is used in a vacuum-cleaner to pump air into 


the Quah: Dag so swim = Maken Ui NU RR eae О eS ea om 
What is the pressure in pounds of ordinary air per square 
ТОНА А MUS RRO E NSS Ad: oc NTAS Gh DIS) ALET Ree ul 


An Exercise from a Biology Workbook” 
Directions: As you locate each part using а hand lens on an actual specimen, 


“Henry Daniel Rinsland, op. cit., pages 23-222. 

15 Devised by C. W. Stone, and published by Bureau of Publications, Teachers Col- 
lege, Columbia University. 

'* Devised by W. W. D. Sones and David P. Harry, Jr., and published by World 
Book Company. 

" Devised by H. T. Lundholm and L. P. Siceloff, and published by Cooperative 
Test Service. 

!5 Devised by С. D. Stoddard and J. Cornog, and published by Extension Division, 
State University of Iowa. 

?? Devised by С, J. Pieper and W. L. Beauchamp, and published by Scott, Foresman 
& Company. ; 

2 Prepared by Arthur О. Baker and Lewis H. Mills to accompany their Dynamic 
Biology Today, Chicago; Rand McNally & Company, 1943, 


SPECIFIC TYPES OF OBJECTIVE TESTS 169 


find the corresponding part in the accompanying illustration and label it. Consider 
how each part funetions in the life of the grasshopper. 


Parts of the Grasshopper 


The following items from an informal class test illustrate the possibilities 
of recall tests with more than one response to each item: 
For each event below give the country, year, and person with whom you 


associate it: 
Event Country Year Person 


First psychological laboratory.....-- 
First general intelligence test........ 
First standardized achievement test 


II. Rules and Suggestions for Construction 


The simple-recall is one of the most familiar test forms and one of the 
easiest to prepare. The main problem is how to phrase the test situations 
so that they will call forth responses of a higher intellectual level than mere 
rote memory, and so that they can be scored with a minimum expenditure 
of time and effort. 

1. The direct-question form 1s usually preferable to the statement form. It is more 
natural for the pupil and is likely to be easier to phrase. 

EXAMPLE: The first president of the United States was 
BETTER: Who was the first president of the United States? chs 

2. The questions should be so worded that the response required is as brief as possible, 
preferably a single word, number, symbol, or at most a short phrase. This will objectify 
and facilitate scoring. 


3. The blanks provided for the responses should be in a column, preferably at the 
right of the questions. This arrangement facilitates scoring and is more convenient 
for the pupil. The illustrations above show various ways of arranging the answer 
column. : 


170 THE CONSTRUCTION OF TEACHER-MADE TESTS 


4. The use of textbook language in wording the question should be reduced to the 
minimum. Unfamiliar phrasing will reduce the possibility of correct responses that 
represent mere meaningless verbal associations, and also will eliminate the tempta- 
tion of pupils to memorize the exact language of the book. 


5. The questions should be so worded that there is only one correct response. This is 
a standard which is difficult to reach, since pupils are marvelously resourceful in 
reading into questions interpretations which the teacher never intended. For ex- 
ample, the question in ancient history, "Name two ancient sports" elicited this 
reply from an ingenious student, “Antony and Cleopatra." This possibility would 
not have arisen had the question taken this form: ^What were two popular athletic 
contests in ancient Greece?" All acceptable replies which are based on any legiti- 
mate interpretation of the question should receive credit, and must be listed on 
the scoring key. Extra care in wording the questions will save much time and 
trouble later. 


C. Completion Tests 


Definition. The completion test may be defined as a series of sentences 
in which certain important words or phrases have been omitted and blanks 
submitted. for the pupil to fill in. A sentence may contain one or more 
blanks. The sentences in the test may be disconnected, or they may be 
organized into a paragraph. Each blank counts one point. 

Advantages and limitations. The mental processes which the pupil 
must employ in supplying the responses required in completion tests are 
very similar to those required in simple recall tests, although perhaps on a 
somewhat higher level. It is not surprising that the advantages and limita- 
tions of these two types of tests are also similar. The completion test has 
wide applicability, as far as subject-matter is concerned, but unless pre- 


pared with extreme care is likely to measure rote memory rather than real : 


understariding; or it may turn out to be more a measure of general intelli- 
gence or linguistie aptitude than of school achievement. 

The scoring is likely to be even more laborious than that of simple-recall 
tests. This is not only because the scoring is somewhat subjective, but also 
because the missing words are written in blanks scattered all over the page, 
rather than in à column. While these limitations cannot be entirely elimi- 
nated, they can be greatly reduced, as is evident from the illustrations 
below. 


I. Illustrations of Completion Tests 


Stanford Achievement Test, Paragraph Meaning, 1940 Edition?! 


Directions: [Abridged] Write JUST ONE WORD on each line. Be sure to write 
each answer on the line that has the same number as the missing word in the paragraph. 


*! Devised by Truman L. Kelley, Giles M. Ruch, and Lewis M. Terman, and pub- 
lished by World Book Company. 


SPECIFIC TYPES OF OBJECTIVE TESTS 171 


1-2-3 Answer 
In olden days men made their own pens from the quills 
of feathers. It required considerable skill to cut a pen properly 
so as to suit one's individual taste in writing. Students were 1 
always on the lookout for good goose, swan, turkey, or other 
bird feathers. Goose quills made the most satisfactory —1— — 2. — ———— 
for general —2—, but schoolmasters liked pens made from the 
—3— of swan feathers because they fitted best behind the ear. 3 .—— 


Public School Attainment Tests for High School Entrance? 


3. Question: Did this team have a coach? 
Answer: No, they taught — (3) howto play without any 
coach, an ЛЕНЕНЕН 
4. Q: Did all of you have matches? 
‘A: Of course! Each one had (4) own water-proof box full. (4) 


Tests of Everyday Problems in Science, Unit XI” 


A pry-pole is an example of a machine called the 
A capstan is an example of a machine called the 
А screw is an example of a machine called the... -+--+ 
Your teeth are examples of machines called........-+---- 


Gregory Tests in American History 


Write your words ч 
and dates here. 
2. The man who headed the first expedition to cir- 


cumnavigate the globe was ет: 2! 
7. The Articles of Confederation were in force from 

1781 ю ОЪ ИЕА SLICES Аны Meu rri 
9. The “Old Liberty Bell” rang out the decision of 

Congress to be free from England in the year.. . - ее ще 


Cooperative English. Test, Series 125 


20. Write on the lines to the right the contraetions—shortened forms to represent 
how the words are naturally spoken—for the seven groups of words underlined 
in the following sentences. For instance, for do nol, you would write don’t. You 
need not copy the sentence, but only the seven contractions. 


I have read his story, but I cannot 
believe that he will get a passing 
grade on it, for i£ is not well written 
and has not a clear-cut plot. The ат ———— — — — —— 

acters are not at all interesting; they are. — ———————— ———————— 


not even human. ——— 


^ 2: Devised by Henry D. Rinsland and Roland L. Beck, and published by Public School 
Publishing Company. (The form of completion with all responses in a column, instead 
of staggered within sentences, was devised by Rinsland.) i 

28 Devised by C. J. Pieper and W. L. Beauchamp, and published by Scott, Foresman 
& Company. 

24 Devisod by C. A. Gregory, and published by C. A. Gregory Company. . 

?5 Devised by Sterling A. Leonard and others, and published by Cooperative Test 


Service, 1934. 


172 THE CONSTRUCTION OF TEACHER-MADE TESTS 


II. Rules and Suggestions for Construction 


Most of the suggestions made for constructing simple-recall tests apply 
equally well to completion tests. The dangers to be avoided are largely the 
same for both forms. A few suggestions may be offered, however, that have 
special reference to completion items. The main problems of construeting 
completion tests are three in number: (1) How to phrase the Statements 
80 as to indicate the type of response desired ; (2) How to avoid giving the 
pupil unwarranted clues to the specific responses expected; and (3) How 
to arrange the items so as to facilitate scoring. The first two suggestions 
below apply to problem one, the next five apply to problem two, and the 
last five suggestions are for problem three. In short, a good completion 
Statement gives a reasonable basis for determining the response desired 
without providing wnwarranted clues, and is arranged to facilitate scoring. 


1. Avoid indefinite statements. The pupil is entitled to know the type of response 
desired, and when this is done the scoring is far more rapid. 


EXAMPLE: Abraham Lincoln was born in 
BETTER: Abraham Lincoln was born in the state of 
The year of Abraham Lincoln’s birth was 


The first statement fails to indicate whether the desired response is the date, the 
place, or the circumstances of his birth. In that form, legitimate answers might 
be “February” ог “1809,” for the date; “Kentucky,” or possibly “The South,” 
for the place; and “poverty,” or “a log cabin,” for the circumstances of his birth. 
By a slight change in wording the statement is made quite definite. 
2. Avoid overmutilated statements. If too many key words are left out, it is im- 
possible to know what meaning was intended. 
EXAMPLE: The. — is obtained by dividing Ње 
by the t 
In its present form, it is impossible to tell whether the statement refers to edu- 
cational measurement or to arithmetic, 


BETTER: 

1. The IQ is obtained by dividing the дел руаінет 25222 

2. The is obtained by dividing Ње by the 
divisor. 


3. Omit key words and phrases, rather than trivial details. If this is not done the 
response may be as obvious as the first example below, or as unnecessarily difficult 
as the second example. 


EXAMPLES: 
1. Abraham Lincoln was born February , 1809. 
2. Abraham Lincoln was born in. County, Kentucky. 


4. Avoid lifting statements directly from the text. This puts too great a premium 
upon rote memory. 


5. Whenever possible, avoid “а” or “an” immediately before a blank. These words 
unnecessarily limit the responses that can be used in the blank, 


SPECIFIC TYPES OF OBJECTIVE TESTS 173 


EXAMPLE: Mary picked an _ —  — à  offthe tree and ate it. 
BETTER: Магу ate the ___________ which she picked off the tree. 


It is apparent that the words “pear,” “peach,” “plum,” “cherry,” “lemon,” 
“pineapple,” and the like could not be used in the first statement. In fact, the 
choice tends to narrow down to two familiar fruits, “apple,” or “orange.” The 
second statement contains no specific determiner. 


6. Make the blanks of uniform length. If the blanks vary in length the pupil 
has a clue to the length of answer expected. Even more of a clue is afforded by 
using a dot or a dash for each letter in the correct word. 


EXAMPLE: 


1. The second president of the United States was ____ ----- from the 
state обо C LET IC ERE 


2. The president in office during the Mexican War was ----- ~ ---- from 
the state of _-_-_--__. 


BETTER: 


1. The second president of the United States was ____________ from the 
State of ee 


2. The president in office during the Mexican War was —_____—— from 
the state of 


7. Avoid grammatical clues to the answer expected. 
EXAMPLE: The authors of the first performance test of intelligence were 


BETTER: The first performance test of intelligence was prepared by 


8. Choose statements in which there is only one correct response for the blanks. 
The scoring is far more objective if only one specifie word or phrase can be used 
to complete the statement. 


9. The required response should be a single word or a brief phrase. The more the 
scorer has to read the more time will be required. 


10. Arrange the test so that the answers are in a column at the right of the sentences. 
The illustrations above show various ways in which this may be done. When each 
sentence contains but a single blank, the scoring is made easier if the blank comes 
at the end. The Tests of Everyday Problems in Science and the Gregory Tests in 
American History are examples. If the sentences contain more than one blank, the 
scoring is more rapid if the blanks are numbered and the pupil is directed to write 
his responses in the correspondingly numbered blank in the answer. column at 
the right. Rinsland?* suggests that the following wording of the directions will be 
clear to pupils above the fourth grade, although it may be necessary to explain 
the word “correspondingly” in grades four to seven: 

Drructrons: In each of the sentences below, one or more words, numbers, or dates 
are needed in the numbered blank spaces to make the sentences complete and true. 
Place the word or words in the correspondingly numbered blank to the right. 


11. Prepare a scoring key which contains all acceptable answers. Although it is 
desirable P have only one response which can be considered correct for each blank, 


2 Henry Daniel Rinsland, op. cil., page 56. 


174 THE CONSTRUCTION OF TEACHER-MADE TESTS 


this is not possible in all cases. As a rule, a satisfactory key can be made by writing 
in red the correct answers on a copy of the test. 

12. Allow one point for each blank correctly filled. Avoid fractional credits and 
unequal weighting of items on the basis of difficulty or importance. 


D. Alternative-Response Tests 


Definition. An alternative-response test is made up of items each of 
which admits of only two possible responses. The usual form is the familiar 
true-false test. Other similar forms are right-wrong, correct-incorrect, yes- 
no, same-opposite, and two-option multiple-choice. 

Advantages and limitations. Obvious advantages of the alternative- 
response test are its apparent ease of construction, applicability to a wide 
range of subject-matter, objectivity of scoring, and wide sampling of 
knowledge tested per unit of working time. The true-false test, a form 
very popular with classroom teachers, has been the object of more research 
and of more criticism than any other form of objective test. The negative- 
suggestion effect and the factor of guessing are often pointed out as limita- 
tions of this type of test. While the use of the correction formula appears 
to make a fairly satisfactory adjustment for guessing in the total score, 
the alternative-response is not well adapted to educational diagnosis. The 
danger of negative suggestion when pupils see statements which are false 
has apparently been overestimated, but perhaps it is wise not to use true- 
false tests as pretests or with young children. In such cases it is better to 
avoid the alternative-response test, or to use a question that can be an- 
swered by yes or no instead of a declarative statement. 

Several modifications and alleged improvements of the true-false test 
have been proposed. Barton,” for example, has suggested crossing out the 
part of the statement that is in error, while other studies? have shown that 
having pupils correct the wrong statements increases the reliability of the 
test. Still others® have proposed that items be weighted according to 
the judgment of the pupil, or be marked true, false, doubtful. All of these 
suggestions add somewhat to the labor of scoring and have not received 
wide acceptance. Furthermore, strictly speaking, when these modifications 
are followed, the test is no longer of the alternative-response type. As a rule, 
the most obvious way to “improve” the true-false test is also the best; 

2 үү. A. Barton, Jr., “Improving the True-False Examination,” School and Society, 
34: 544-546, October 17, 1931. 

?* Ernest E. Bayless and Ralph C. Bedell, “А Study of Comparative Validity as 
Shown by a Group of Objective Tests," Journal of Educational Research, 23: 8-16, 
January, 1931; F. D. Curtis, W. C. Darling, and М. H. Sherman, “А Study of the 
Relative Values of Two Modifications of the True-False Test," Journal of Educational 
Research, 36: 517-527, March, 1943; W. H. E. Wright, "The Modified True-False Item 
Applied to Testing in Chemistry," School Science and Mathematics, 44: 637-639, Oc- 
tober, 1944. 

?* Kate Пеупег, “A Method for Correcting for Guessing in True-False Tests and Em- 
pirical Evidence in Support of It,” Journal of Social Psychology, 3: 359-362, August, 
1932. 


SPECIFIC TYPES OF OBJECTIVE TESTS 175 


that is, make the test longer and prepare it more carefully. At least 75 items 
are desirable, and 50 may be set as an absolute minimum, unless the test 
covers à very narrow range or is used for instructional purposes only. One 
advantage of the true-false test is that it can cover more items in the same 
time than any other test type. 

Should pupils be advised to look over true-false tests and change the 
answers on doubtful items? Several studies have attempted to answer this 
question. Hill? made an extensive investigation of the problem and came 
to the conclusion that there is “not much advantage to be gained by 
changing one's answers on a true-false test," although the advantage was 
somewhat greater in changing from true to false than in the reverse. There 
is some evidence that the better pupils profit most from rechecking and 
revising their work. Even if the scores are not always improved, it is prob- 
ably a good work habit to encourage. 

The low esteem in which test experts hold the alternative-response type 
of test, especially the true-false form, is indicated by the infrequency with 
which it has appeared in recent standardized achievement tests. This is due 
chiefly to its weakness as an instrument of diagnosis, and to the fact that 
such tests must be made much longer than other objective tests in order to 
secure comparable reliability. Although this type of test has been greatly 
overworked by classroom teachers, it does have a legitimate, though re- 
stricted, use in informal tests. For example, the true-false test seems well 
adapted to testing the persistence of popular misconceptions and super- 
stitions. Ordinary alternative-test situations are encountered in which it is 
difficult or impossible to make more than two plausible responses for a 
multiple-choice test. There are many troublesome situations of this sort in 
language usage. Common examples include the case forms in pronouns, 
correct use of singular and plural verbs, confusions of past tense and past 
participles, the use of sit and set, lay and lie, and many others. A safe rule 
would be to restrict the use of the alternative-response test to those situations 
to which other test forms are inapplicable, and then to give particular care to 
ihe wording of the items. 


I. Illustrations of Alternative-Response Tests 
California Achievement Tests—Advanced Battery, Form ААМ 
Directions: In the following sentences, mark as you have been told the number 
of each correct word. z d 
Test 5—Section C 


36. ("Ign't Aren't) the baskets filled with flowers? 
47. I approve of (his іш) going. 


36 
Loeb ra AY, 


2 George E. Hill, “The Effect of Changed Responses in True-False Tests,” Journal 
of Educational Psychology, 28: 308-310, April, 1937. 

п Devised by Ernest W. Tiegs and Willis W. С 
Test Bureau. 


lark, and pubushed by California 


176 THE CONSTRUCTION OF TEACHER-MADE TESTS 


For each statement given below that is a complete sentence, mark YES; for each 
that is not, mark NO. 
51. When we approached the deserted farmhouse at night. YES NO 5i 
56. The mountains resounded with peals of thunder which indi- 
cated the storm's fury. YES NO 56 
Iowa Silent Reading Tests, New Edition, Sentence Meaning, 
Elementary, Form Am? 


Drrections: Read each question. If the answer is "Yes," fill in the space under 
YES in the margin. If the answer is “No,” fill in the space under NO. Study the 
sample. Do not guess. 


1. 15 a dime less in value than a nickel? .........., OPI aM is 1| YES NO 
2. Can we see things clearly in a thick fog? .................. 2 YES NO 
3. Is geography studied in publie schools? ................... 3 YES NO 


HE 
Du 


Allport-Vernon-Lindzey “Study of Values” з 


Diections: A number of controversial statements or questions with two alterna- 
tive answers are given below. Indicate your personal preferences by writing appro- 
priate figures in the boxes to the right of each question. . . . For each question 
you have three points that you may distribute in any of the following combinations. 
If you agree with alternative (a) and disagree with (b), write [3 in the first box 
and 0 in the second box]. 
If you agree with (b); disagree with (a), write [0 in the first box and 3 in the 
second box]. 
If you have a slight preference for (a) over (b), write [2 in the first box and 1. 
in the second box]. 
If you have a slight preference for (b) over (a), write [1 in the first box and 2 in 
the second box]. 


1. The main.object ot scientific research should be the discovery of truth rather 
than its practical applications. (a) Yes; (b) No. 


a b 
10. If you were a university professor and had the necessary ability, would you 
prefer to teach: (a) poetry; (b) chemistry and physics? 
a b 


Directions: Classify the italicized words in the sentences below as adjectives or 
adverbs by placing check marks in the proper columns: 


Tests in English Fundamentals: Grammar? 


* Devised by H. A. Greene and V. H. Kelley, and published by World Book Company. 

38 Devised by Gordon W, Allport, Philip E. Vernon, and Gardner Lindzey, and pub- 
lished by Houghton Mifllin Company, 1951. 

*! Devised by R. Davis, and published by Ginn and Company. 


SPECIFIC TYPES OF OBJECTIVE TESTS 177 


Adjective 


Adverb 


3. That was a silly remark, 3 
6. Those flowers smell sweet. 6 


11. You can hardly expect him to wait. ll 


The Iowa Every-Pupil Tests in Basie Skills** 


Directions: In each of the following sentences there are two or more numbered 
words or phrases inclosed in brackets. If you think the first word or phrase is 
correct, place an X in the first box of the corresponding row on the answer sheet. 
If you think the second answer is correct, place an X in the second box of the 
proper row, ete. 
l.a 
7. Ted is { | industrious man. 
\ 2, an 
1. has \ 
54. My father i no money. 
2. hasn't 
1. himself 
62. I want everyone to help | . 
2. themselves 


Cooperative Plane Geometry Test, Revised Series Q% 
Directions: Read these statements and mark each one in the parentheses at the 
right with a plus sign (++) if you think it is always true. or with a zero (0) if you 
think it is always or sometimes false. З 


1. The opposite angles of а par- 17. If two triangles are similar, 
allelogram are equal .......- қ) their areas are in the same 
ratio as the medians drawn 
to corresponding sides .... 17( ? 
2, A diameter of a circle divides 18. All similar polygons are 
the circle into two equal parts 2( ) equilateral .............. 18( ) 


Tests on Everyday Problems in Science: Unit ITI” 


Directions: There are 25 incomplete statements in this test, each followed by 
parts (а), (b), (с). and (d). One or more of these parts, or perhaps none of them, 
correctly complete the incomplete statement. You are to place a plus sign (+) in 
the parentheses (near the right margin) opposite each part which correctly com- 
pletes the statement, and a minus sign (—) opposite each part which does not 
correctly complete the statement. 


35 Devised by IL. A. Greene, and published by Extension Division, State University 


of Iowa, 1939. К . 
з Devised by Emma Spanney and L. P. Siceloff, and published by the Cooperative 


Test Service. K 4 
37 Devised by С. J. Pieper aud W. 1. Beauchamp, and published by Scott, Foresman 


& Company. 


178 THE CONSTRUCTION OF TEACHER-MADE TESTS 


13. Minerals in our food supply 
(a) furnish heat and energy to the body ........................... CAT 
(b) are the only materials of which cells can be built ................ (ry 
(с) are good regulators of certain of the body activities ............. CA 
(d) help particularly to build bone and blood ...................... (9 


Cooperative Solid Geometry Tests® 


Directions: Read these statements and mark each one in the parentheses at the 
right with a plus sign (+) if you think it is true, or with a zero (0) if you think 
it is false, wholly or in part. 

4. Any number of planes may be passed through a given straight line 


0) 
27. Two planes parallel to the same straight line are parallel to each other ( ) 
41. The square of a diagonal of a cube is three times the square of its edge ( ) 


George Washington University English Literature Test? 
F 1. “П Penseroso" describes the charms of a merry social life. 
F 4. "Pilgrim's Progress" is one of the greatest prose allegories in 


literature. 
F 8. In his poem “The Bells," Poe describes the process of making bells. 


HH 


II. Rules and Suggestions for Construction 


The true-false test is often thought to be one of the easiest types to 
prepare. This superiority is more apparent than real, however. Experienced 
test makers are convinced that no test form demands greater skill. Unusual 
care must be exercised in wording true-false statements so that the content 
rather than the form of the statement will determine the response. The 
&im should be to phrase the statement so as not to make its meaning need- 
lessly obseure on the one hand, nor to provide unwarranted clues on the 
other. This balance requires a delieate skill of adjustment that is rare 
among makers of informal tests. The following specific suggestions may be 
found helpful in constructing true-false tests. Many of the suggestions for 
constructing multiple-choice tests that are found in the next section are 
also applicable here. 


1. Avoid specific determiners. It has been found that strongly worded statements 
are much more likely to be false than true, while moderately worded statements are 
much more likely to be true than false. Examples of the former are those con- 
taining “all,” “always,” "never," “no,” "none," “nothing,” and the like; examples 
of the latter are those containing "may," “some,” “sometimes,” “often,” “аз а 
rule,” and the like. If care is taken to balance the proportion of true and false 
items containing any particular expression, that expression ceases to be a specific 
determiner that affords a clue to the answer. 


2. Avoid a disproportionate number of either true or false statements. Since several 
studies have shown that false statements are more valid than true statements, 
the suggestion is sometimes made that the test should have more false statements 

зз Devised by H. T. Lundholm and others, and published by Cooperative Test 
Service, 1934. 

** Devised by К. T. Omwake and others, and published by Center for Psychological 
Service. 


SPECIFIC TYPES OF OBJECTIVE TESTS 179 


than true. If this were generally done, however, the validity of the false state- 
ments would probably be reduced, since the pupil would then tend to mark all 
doubtful statements false. Tell the students that approrimately half of the state- 
ments are true, the other half false. 


3. Avoid the exact language of the textbook. Lifting true statements directly 
from the textbook, or making false statements by changing a single word or ex- 
pression puts too great a premium on rote memory. 


4. Avoid trick statements. These are usually statements which appear to be true 
but which are really false because of some inconspieuous word or phrase. 


EXAMPLES: 1. “The Raven" was written by Edgar Allen Poe. 
2. The battle of Hastings was fought in 1066 в.с. 

BETTER: 1. "The Raven" was written by Edgar Allan Poe. 
2. The battle of Hastings was fought in 55 в.с. 


Also avoid “double-headed” statements like the following one (especially if they 
are partly true and partly false): Poe wrote “The Gold Bug” and “The Scarlet 
Letter.” 


5. Avoid double negatives. Such statements are especially bad, since pupils well 
versed in English grammar might conclude that two negatives equal an affirmative, 
while other pupils would interpret such statements as emphatic negatives. 


6. Avoid ambiguous statements. With one interpretation the statement may be 
true and with another equally plausible interpretation it may be false. It is im- 
possible to tell what is being measured when a statement has more than one 
legitimate interpretation. 


7. Avoid unfamiliar, figurative, or literary language. The experience of the learner 
must be considered. A statement is badly worded when a pupil who understands 
the point involved misses it because of the language employed. 


8. Avoid long statements, especially those involving complex sentence structure. 
Same reason as for the preceding suggestion. 


9. Avoid qualitative language wherever possible. Quantitative language conveys 
more exactly the meaning intended. Expressions such as “few,” “many,” “large,” 
“small,” “old,” “young,” “important,” “unimportant,” are vague and indefinite. 


10. Require the simplest possible method of indicating the response. Instead of 
requiring the pupil to write True and False or Yes and No, let him write 7 and F, 
Y and N, or underline the correct response. The symbols “+” for true and “0” 
for false are so distinct as to make scoring still easier. When the pupil must choose 
between two words or expressions, the responses should be numbered so that they 
can be indicated by writing the correct number. 

11. Indicate by a short line or by ( ) where the response is to be recorded. The 
responses may be arranged in а column at either the left or right of the statements. 
Most scorers prefer the answers at the right. 

12. Arrange the statements in groups. There is some advantage in scoring if the 
items are arranged in groups of five, with double spacing between each group. 


E. Multiple-Choice Tests 


Definition. A multiple-choice test is made up of items each of which 
presents two or more responses, only one of which is correct or definitely 


180 THE CONSTRUCTION OF TEACHER-MADE TESTS 


better than the others.? Each item may be in the form of a direct question, 
an incomplete statement, or a word or phrase. This form of test is to be 
distinguished from the multiple-response type, which requires that two or 
more responses be made to a single item. 

Possibilities and limitations. The multiple-choice type of item is 
usually regarded as the most valuable and most generally applicable of all 
test forms. Lee regards it as “опе of the best means for testing judgment 
that is available.” 4 Lindquist asserts that it is "definitely superior to other 
types” for measuring such educational objectives as “inferential reasoning, 
reasoned understanding, or sound judgment and discrimination on the part 
of the рирИ.”*? Cronbach“ regards it as being practically free from “re- 
sponse sets," the tendency for examinees to select a given option position 
more often than would be predicted on the basis of chance alone. 

One study“ suggests fourteen types of questions which may be asked 
in multiple-choice test items. The list is not all-inclusive and does not 
intend to prescribe the exact language to be used but serves as a guide 
in formulating the questions. 


1. Definition 
a. What means the same as . . . ? 
b. What conclusion сап be drawn from . . . ? 
с. Which of the following statements expresses this concept in different form? 


2, Purpose 
a. What purpose is served by . . . ? 
b. What-principle is exemplified byte, 
c. Why is this done? 


d. What is the most important reason for . . . ? 
3. Cause 

a. What is the cause оѓ... ? ; 

b. Under which of the following conditions is this true? 
4. Effect 


a. What is the effect of . . . ? 
b. If this is done, what will happen? : 
€. Which of the following should be done (to achieve a given purpose)? 


5. Association : 
What tends to occur in connection (temporal, causal, or concomitant asso- 
ciation) with . . . ? 


49 It is also possible, especially in English usage and spelling tests, to have several 
correct forms and only one incorrect or least desirable form, which is to be chosen in 
each item. , 

“J. Murray Lee, A Guide to Measurement in Secondary Schools, page 379. New York: 
D. Appleton-Century Company, 1936, 

“Herbert E. Hawkes, E. F. Lindquist, and С. В. Mann, op. cit., page 138. 

43 Lee J. Cronbach, “Further Evidence on Response Sets and Test Design,” Educa- 
tional and Psychological М. easurement, 10: 3-31, Spring, 1950. 

44 Charles I. Mosier, M. Claire Myers, and Helen G. Price. "Suggestions for the 
Construction of Multiple-Choice Test Items," Educational and Psychological Measure- 
ment, 5: 261-271, Autumn, 1945. 


SPECIFIC TYPES OF OBJECTIVE TESTS 181 


6. Recognition of Error 
m of the following constitutes an error (with respect to a given situ- 
ation)? 
7. Identification of Error 
a. What kind of error is this? 
b. What is the name of this error? 
c. What recognized principle is violated? 
8. Evaluation 
What is the best evaluation of . . . (for a given purpose! and for what 
reason? 
9. Difference 
What is the important difference between . . . ? 
10. Similarity 
What is the important similarity between . . . ? 
11. Arrangement 
In the proper order (to achieve a given purpose or to follow a given rule), 
which of the following comes first (or last, or follows a given item)? 
12. Incomplete Arrangement : 
In the proper order, which of the following should be inserted here to com- 
plete the series? 
13. Common Principle 
All ot the following items except one are related by a common principle: 
a. What is the principle? 
b. Which item does not belong? 
c. Which of the following items should be substituted? 
14. Controversial Subjects Ped 
Although not everyone agrees on the desirability of ________., those 
who support its desirability do so primarily for theyeason that я 


Unusual care must be exercised in the construction of multiple-choice 
items in order to avoid the inclusion of irrelevant or superficial clues, and 
to insure that the tests measure something more than the memory of factual 
knowledge. The value of multiple-choice tests in diagnosis depends upon 
the skillful selection of the incorrect choices presented in the items. 


I. Illustrations of Multiple-Choice Tests 


The items below, taken from standard tests, illustrate several different 
arrangements of multiple-choice tests in a variety of subjects.“ This type 
of test is widely used in all school subjects, on all educational levels, and for 
measuring a variety of teaching objectives. weit 

Special attention should perhaps be called to two of the illustrations, 
both.of which are suggestive to teachers in.making-informal tests. The 
Nelson High School English Test illustrates the possibility of testing punc- 
" 45 Ellis Weitzman and Walter J. McNamara, “Apt Use of the Inept Choice in Multiple- 
Choice Testing,” Journal of Educational Research, 39: 517-522, March, 1946. 


55 These tests are not all equally good, however. The reader will note that some of 
them are not wholly consistent with the principles set forth in this chapter, 


182 THE CONSTRUCTION OF TEACHER-MADE TESTS 


tuation with a minimum of scoring labor. The Cooperative Test of Social 
Studies Abilities shows how objective tests may be used to test more than 
the memory for factual knowledge. This is a good example of a test of the 
pupil's ability to interpret facts—an ability which is an important aspect 
of thinking. 


Kuhlmann-Finch Intelligence Tests, Test IV” 
13. Early is to begin as late is to 


1 2 3 4 5 
start end awake enter prompt 1B! 2.5 
22. Flour is to bread as sugar is to 
1 2 3 4 6 
sweet candy fruit cook eat а 


The Modern School Achievement Tests, Language Usage“ 


Directions: In each sentence, choose the word or group of words that make the 
best sentence. Then on the dotted line at the right, copy the number that is before 
the correct form. 


1. off 
4. I borrowed a pen ООН ОЕТ Боен о cs riesce ris 
3. from 
1. your 
7. Every student must do 2. his Dos у ТАРААВ Ct asisten 
3. their 
1. has got 
17. He 2. has ISIC HO ОНАН ARN SIT GM scr ll m 


.3. has gotten 


The Barrett-Ryan Literature Test: Silas Marner?? 


А. ( ) An episode that advances the plot is the—1. murdering of a man. 2. kid- 
napping of a child. 3. stealing of money. 4. fighting of a duel. 

B. ( ) Dolly Winthrop is—1. an ambitious society woman. 2. a frivolous girl. 
3. a haughty lady. 4. a kind, helpful neighbor. 

C. ( ) A chief characteristic of the novel is—1. humorous passages. 2. portrayal 
of character. 3. historical facts. 4. fairy element. 


Wesley Test in Political Terms” 


1. An embargo is 
1. a law or regulation 2. a kind of boat 3. an explorer 
4. а foolish adventure 5. an embankment Cy 


2. An injunction is a 
1. part of speech 2. wreck 3. union of two things 
4. court order 5. form of advice OD 


41 Devised Бу F. H. Finch, Е. Kuhlmann, and G. L. Betts, and published by Educa- 
tional Test Bureau. 

48 Devised by A. I. Gates and others, and published by Bureau of Publications, 
Teachers College, Columbia University. 

^ Devised by E. R. Barrett, T. M. Ryan, and H. E. Schrammel, and published by 
Kansas State Teachers College, Emporia. 

50 Devised by E. B. Wesley, and published by Charles Scribner’s Sons, 


` SPECIFIC TYPES OF OBJECTIVE TESTS 183 


Unit Scales of Attainment in Foods and Household Management * 


2. The spoon should be placed 
1. at the top of the plate 
2. at the left of the fork 
8. in the spoon holder on the table 
4. at the right of the ене ©) 


40. We get the most calories per pound from 
1. proteins 2. carbohydrates 
3. fats 4. mineral matter 
B. vitamins Ене беа о Gad 


Traxler Silent Reading Test, Word Meaning* 


8. The commendation is deserved. 

(1) success (2) blow (3) popularity (4) good fortune 

(5) praise (0) 
9. His actions received condemnation, 

(1) approval (2) applause (3) censure (4) sympathy 

(B) contempt Co) 


Cooperative French Test, Junior Form 1936 


2. Quand on vous pose une question, il faut 
1 répondre, 2 se taire, 3 se sauver, 4 tourner le dos, 5 baisser la 


REGO ARR н EE DE C) 
7. Cette dame est ma grand’mére; je suis 
1 son fils, 2 воп neveu, 3sonfrére, 4son cousin, 5sonpetit-fils .. ( ) 


16. J’ai deux fréres, Jean et Paul. Jean a sept ans, Paul en a treize et moi 
j'ai douze ans. Qui est le plus jeune? 
1 Jean, 2 Paul, 3 moi, 4 dix ans, Б les deux frères ............. (y 


Nelson High School English Test? 


Directions: Some of the sentences contain errors in punctuation; some of them are 
correct. If you think some mark is not needed, cross out the letter indicating that 
mark under the word “Omit.” If you think some additional mark is needed, cross 
out the letter indicating that mark under the word “Add.” If you think the exercise 
is correct, cross out the letter r. Key: a—apostrophe; c—comma; d—dash ; e—excla- 
mation point; h—hyphen; p—period; q—quotation mark; s—semicolon. 

Add Omit Right 


1. You must elect a chairman, three 
judges and an official timekeeper. xX hds q 
6. He said “that either you or I must 
go.” es de Xx T 
8. The car which John is driving is a 
new one. 
14. “Wel, I think highly of them Mary” 
I said. 


5! Devised by Ethel B. Reeve and Clara M. Brown, 
Test Bureau, Inc. 

52 Devised by Arthur E. Traxler, and h 

м Devised by Jacob Greenberg and Geraldine 
tive Test Service. 

54 Devised by M. J. Nelson, and published by Hou 


r 


d аз < а x 


ehs Ж p r 
and published by Educational 


published by Public School Publishing Company. 
Spaulding, and published by Coopera- 


ghton Mifflin Company. 


184 THE CONSTRUCTION OF TEACHER-MADE TESTS 


Cooperative Test of Social Studies Abilities, Experimental Form QS ` 
INTERPRETING FACTS 


Directions: The exercises in this part consist of a series. of paragraphs each 
followed by several statements about the paragraph. In the parentheses after each 
statement, put a 


1, if the statement is a reasonable interpretation, fully supported by the facts 
given in the paragraph; у 
2, if the statement goes beyond and cannot be proved by the facts given in the 
paragraph; 
3, if the statement contradicts the facts given in the paragraph. 
[The sample exercise and the explanation are omitted.] 


I. The nineteenth century witnessed a rapid growth in Germany's industrial 
power. Like England, Germany came to have a fairly satisfactory balance between 
-the amount of its export and import trade. Heavy exports of coke supplied full 
cargoes for ships to foreign ports and helped to balance heavy importations of raw 
materials. The imports especially provided a means for distributing freight rates 
to the advantage of the German trader competing overseas. By these means 
Germany was constantly obtaining larger portions of world trade. German wares 
were carried into every trading realm, and trade meant political as well as com- 
mercial power in foreign lands. 


1. Through growth in foreign trade, Germany's industrial power in- 


creased in the nineteenth century ....... UE rr ass nex 16) 
2. Germany had an export trade equal in volume to that ot England 2( 
3. Germany exported very little coke to foreign countries ......... 3( ) 
4. England was unable to balance the tonnage of her import and 
ОхрО ВІН рено Hio one die dic ООК Ердан. 4( ) 
5. By reducing freight rates Germany was constantly gaining a greater 
percentage of: world trade: 2515. 208 2 LX is ees les ces 5( ) 
6. The sale of German wares in every part of the world resulted in 
added political influence and commercial growth ............... 6( ) 


Several kinds of multiple-choice items, including analogies, are illustrated in 
Appendix A, pages 429—435. 


II. Rules and Suggestions for Construction 


The purpose of suggestions 1 to 5 below is to avoid unwarranted clues 
to the desired response, the purpose of suggestions 6 to 9 is to encourage 
responses on a high intellectual level, and the purpose of suggestions 10 to 
14 is to make the scoring as simple and rapid as possible. 


1. Make all optional responses grammatically consistent. For example, if the 
verb is singular, avoid plural responses, and vice versa. Avoid using “a” or “ап” 
as the word in an incomplete statement immediately preceding the list of responses, 
unless all options begin with consonant sound (in the ease of “a”) or all begin with 
a vowel sound (in the case:of “an’’), Е 

2. Аз a rule, use direct questions rather than incomplete statements. The question 
form is more natural and less likely to contain irrelevant clues. 

3, Avoid making the correct. response consistently longer or shorter than the others. 


55 Devised by J. Wayne Wrightstone, and published by Cooperative Test Service. 


SPECIFIC TYPES OF OBJECTIVE TESTS 185 


4. Avoid using in the correct response the same words or phrases that occur in the 
question or incomplete statement. 

5. Arrange the responses so that the correct one occurs in random order. The pupils 
are likely to detect any regularly recurring pattern in the sequence of responses, 

6. Make all responses plausible. In: phrasing multiple-choice test items, con- 
sideration should be given to the fact that the answer may be arrived at by elimi- 
nating the incorrect responses as well as by selecting the correct response directly. 
The aim should be to make each suggested response so plausible as to tempt pupils 
who have only superficial knowledge of the point involved. The plausibility of in- 
correct responses may be increased by using familiar, stereotyped, or textbook 
phraseology, or expressions very similar to those in the question or incomplete 
statement. 

7. At least four choices should be presented whenever possible. Increasing the 
number of plausible choices tends to reduce the guessing factor. Horst found, 
however, that when the incorrect responses are of equal difficulty the chance ele- 
ment is less than when the choice is among a greater number of responses with a 
wider range of difficulty. 

8. In testing for the understanding of a term or concept, the term should usually 
be presented first, followed by a series of definitions or descriptions from which the 
choice is to be made. If the order is reversed, so that from a series of terms the choice 
is made of the one that best fits the definition or descriptive statement, the selection 
frequently can be made based upon superficial verbal associations and not upon 
genuine understanding. 

9. To measure the higher levels of understanding, increase the homogeneity of the 
options provided. The following illustration from Lindquist* shows how the degree 
of required discrimination increases with the homogeneity of the responses pre- 
sented: 

A. Engel's law deals with . 
. the coinage of money 
. the inevitableness of socialism 
. diminishing returns 
. inarginal utility 
. family expenditures 
B. Engel's law deals with family expenditures for 

1. luxuries 

2. food 

3. elothing 

4. regt 

5. necessaries 
C. According to Engel's law, family expenditures for food 

1. inerease in accordance with the size of the family 
. decrease as income increases : IRE 
. require a smaller percentage of an increasing income 
. rise in proportion to income я 
. vary with the tastes of families 
To respond correctly to A, all that is required is the knowledge that Engel's law 
deals with family expenditures. In B a knowledge of the specific item of expenditure 
is necessary. The maximum degree of discrimination, however, 18 required in C, 
where still more information is given. х 


Cx нњ со м 


ёл P © t2 


м Paul Horst, “The Difficulty of a Multiple-Choice Test Item," Journal of Educa- 


tional Psychology, 24: 229-232, March 1933. к 
и Herbert E Hawkes; E. F. Lindquist, and C. В. Mann, ор. cit., pages 146-147. 


180 THE CONSTRUCTION OF TEACHER-MADE TESTS 


10. Require the simplest possible method of indicating a response. This usually 
means that the responses are lettered and the choice is made by indicating the 
letter of the response. In the first two or three grades where key letters may 
not be understood, it will be better to permit the more natural response of under- 
tining the correct answer. s 

11. Indicate by a short line or by ( ) where the response їз to be recorded or, better 
still, use a separate answer sheet 

12. Arrange the items in groups. As a rule, groups of five will be suitable, although 
other numbers of items may sometimes be better. Double space between each 
group. 

13. Use the “correction for chance" formula (page 156) if the number of choices is 
fewer than six. If there are six or more responses suggested for each item, the gain 
in validity is seldom sufficient to warrant the labor of making corrections for chance, 

14. Group together all items with the same number of choices. This is especially 
desirable when the correction formula is to be used. 


F. Matching Tests 


Definition. A matching test typically consists of two columns, each item 
in the first column to be paired with a word or phrase in the second column 
upon some basis suggested. In the simplest form of matching test the num- 
ber of responses is exactly the same as the number of items. Frequently, 
matching tests are made which provide more responses than are required, 
Sometimes the items in the first column are incomplete sentences, each of 
which requires a word or phrase from the second column for its completion. 
Occasionally two, or even more, columns of responses are given, from each 
of which a choice must be made for each item in the first column. The 
matching test is also useful for identifying numbered places or parts on 
maps, charts, and diagrams. 

Advantages and limitations. There are many types of learning which 
involve the association of two things in the mind of the learner. Common 
examples are the following: Events and dates, events and persons, events 
and places, terms and definitions, foreign words and English equivalents, 
laws and illustrations, rules and examples, tools and their use, and the like. 
The matching test is a very convenient form of exercise for measuring such 
learning. In the words of Lindquist, “The matching exercise is particularly 
well adapted to testing in who, what, when, and where types of situations, 
or for naming and identifying abilities." 5 

Its principal limitations are as follows: (1) It is not well adapted to the 
measurement of understanding as distinguished from mere memory; 
(2) With the exception of the true-false test, the matching test is the form 
most likely to include irrelevant elues to the correct response; and (3) Un- 
less skillfully made, it is time-consuming for the pupil. The suggestions 
that follow are designed to overcome the last two limitations. The matching 
test can hardly be designed to measure genuine understanding of a high 
level or the ability to interpret complex relationships. 


53 Herbert, E. Hawkes, Е. F. Lindquist, and С. В. Mann, op. cii., page 150, 


SPECIFIC TYPES OF OBJECTIVE TESTS 187 


I. Illustrations of Matching Tests 
The following examples from standard tests illustrate different me- 
chanical arrangements of matching tests in a variety of subjects. 
Every Pupil Test in Physics® 


Dmecrions: Read each definition or description. Then select from the Answer List 
the word thus defined and write its number on the dotted line in front of the 
definition. The answer to the sample is (Power), so 18 is written on the dotted line. 


Answer List: (Arranged alphabetically) 


1, Adhesion 10. Energy 17. Potential 

2. Centrifugal 11. Heat of Fusion 18. Power 

3. Centripetal 12. Heat of 19. Radiation 

4, Cohesion Vaporization 20. Relative Humidity 
5. Conduction 13. Inertia 21. Specific Gravity 
6. Conductor 14. Insulator 22. Specifie Heat 

7. Convection 15. Kinetic * 23. Surface Tension 
8. Density 16. Mechanical 24. Work 

9. Efficiency Advantage 


..18. Samp: The rate of doing work. 


е. 1. Weight per unit volume. 

Tid 2. Mutual force of attraction between like molecules. 

т... 3. Tendency of а body to resist any change in its state of rest or motion. 

diete 4. Tendency of surface of a liquid to contract as much as possible. 

ites 5. Capacity for doing work. 

divis 6. The ratio of resistance overcome to effort exerted. 

Seiad 7. The produet of a force and the distance through which it acts. 

TRT 8. Ratio of output to input. 

Beier 9. The energy a body possesses because of its position. 

err 10. ‘The number of calories required to melt one gram of a substance. 

eee 11. Amount of water-vapor the air holds compared to what it could hold at 
the same temperature. 

re 12. Transfer of heat from a hot to a cold body by molecular collision. 

...:.13. Transfer of heat by means of ether waves. 

des 14, The force pulling the body toward the centre of rotation. 

as tres 15. A substance that conducts heat or electricity very poorly or almost not 
at all. 


Cooperative Test of Social Studies Abilities, Experimental Form 9% 


Directions: In which of the sources listed in the left-hand column would you 
look first to find the items listed in the right-hand column? Consider each group 
separately. Put the number of the best source in the parentheses after each item. 


1 Atlas 51. A discussion of an important present- 
2 Current History day issue in Congress ........ T 5105 
3 Dictionary 52. The location of the ten largest cities in 


4 Economies textbook the тола... 82( ) 


8° Devised by Е. W. Brown and others, and published by the Ohio State Department 


of Education, 1930. Я А Б 
8 Devised by J. Wayne Wrightstone, and published by Cooperative Test Service. 


188 THE CONSTRUCTION OF TEACHER-MADE TESTS 


5 Encyclopedia 53. How to hyphenate the word cinema .. 53( ) 
54. Amendments to the Constitution .... 54( ) 
55. A discussion of standards of living ... 55( ) 

56. The population of a particular small 
US doces RN 56( ) 

1 American history 57. List of news dispatches on CCC ac- 
textbook agire Ee sc S Me o e, X ) 

2 Book of quotations 58. A short account of the early history of 
3 Library catalog Manhattan Island ................. 58( ) 
4 National Geographic 59. The author of Weather in the Street... 59( ) 

Magazine 60. Information about the growth of slav- 
5 New York Times ery in the United States ............ 60( ) 
Index 61. Who said, “Brevity is the soul of wit." 61( ) 

62. Pietures and story of recent develop- 
Ments ingthe EVA T uel. sisi 62( ) 
1 Daily newspaper > 63. The Pulitzer Prize awards of 1930 ... 63( ) 

2 Readers’ Guide to 64. Today’s price quotations on stocks and 
Periodical Literature bande e ts 64( ) 

3 Time 


4 World Almanac 
5 Library catalog 


Cooperative French Test, Junior Form* 


Directions: Each of the English sentences and phrases below is followed by a 
translation in which there is a blank indicated in this way: (___). The translation 
will be correct when one of the five numbered words, phrases, or endings listed at 
the left of the group is inserted in the blank (—_.. Decide which of the five items 
will make the translation complete and correct, and put its number in the paren- 
theses at the right-hand edge of the page. 


oR фом н 


IV 
ce 29. These books. УВЕ да, Cy 
ces 
cet 30. That school. а або ал: CD 
. celles à 
. cette 31. That money. КЕ ar parity alts ote EO (05) 
VIII 
1. qui 38. What are they ask- 
2. quoi ing for? (—) demandent-ils? ...... (e, 
3. quelles 39. Who eame down the (—_) est descendu 
4. que first? le premier? ......... С 
5. qu’ . 40. Which roads are the (___) routes sont les 
best? meilleures? ......... ae Go) 
XIII 
І. se 50. They lighted several On a allumé plusieurs feu- 
2. -es fires. (CAM et ied анн. C 


?! Devised by Jacob Greenberg and Geraldine Spaulding, and published by Coopera- 
tive Test Service, 1936, 


SPECIFIC TYFES OF OBJECTIVE TESTS 189 


3. -x 51. I didn't buy the Je n'ai pas acheté les autr- 
4. -8 other books. C ALEX ENON Seis erus d. (^) 
5. No 52. He had black hair. Il avait les cheveu- (—) 
ending DOM, cones cs Ren wees mt) 
needed 


Sones-Harry High School Achievement Test 


SECTION G. [MATHEMATICS] 
IMPORTANT THEOREMS IN GEOMETRY 


Directions: In the parentheses after each geometric condition given below in 
Column 2, write the number of the results in Column 1 that could be proved by it. 


CoruMN 1 (RESULTS) CoruMN 2 (Conprrions) 
1. angles equal 66. If two opposite sides are equal and 
2. triangles congruent parallel ...... eee rete eens ( )66 
3. triangles similar 67. If perpendicular to the same line..( )67 
4. lines perpendicular 68. If the sides are proportional ..... ( 68 
5. lines parallel 69. If they have equal ares....... Looe 00 
6. quadrilateral is a parallelo- — 70. If side-angle-side equal side-angle- 
gram side respectively .:.:....:.......( 70 
7. parallelogram is a rectangle 71. If they are parallelograms with 
8. two ares equal (in same or equal bases and altitudes ........ art 
equal circles) 72, If their central angles are equal. . .( )72 
9. two chords equal (in same 73. If a tangent is drawn to the radius 
or equal eircles) at point of contact ............. ( )73 
10. areas of polygons equiva- 74. If corresponding parts of congruent 
lent triangles emm ( )74 


75; If one angle is a right angle ......( )75 


II. Rules and. Suggestions for Construction 


The purpose of the first three suggestions is to avoid irrelevant clues and that 
of the remaining five is to reduce the amount of time required to take the test. 


1. Include only homogeneous material in each matching exercise. Do not mix, in a 
single test, such dissimilar associations as persons and events, dates and events, 
terms and definitions. Put short titles at the top of both columns to describe the 
contents accurately. For example: Column 1, Events; Column 2, Dates. 

2. Check each exercise carefully for unwarranted clues that may indicate matching 
pairs. For each item ask yourself this question: “What is the least amount of 
information that must be known in order to select the right response?" 

3. Avoid making the test too easy. The difficulty of a matching exercise may be 
increased by including more responses than needed, and by using some of the 
responses more than once in the same test. у 

4. One list should consist of single words, numbers, or brief phrases. In general, 
the column of short terms should contain the items from which the choice is made. 

5. The items in thé response column should be arranged in systematic order. If the 
list consists of dates, they should be in chronological order. For other items, 
alphabetical order will assist the pupils in locating the desired response. The re- 
sponses in the column should then be numbered consecutively. 


* Devised by W. W. D. Sones and David P. Harry, Jr., and published by Wesld 
Book Company, у 


190. THE CONSTRUCTION OF TEACHER-MADE TESTS 


6. Indicate clearly the basis upon which matching is to be done. This should be 
specified both in the directions and in the column headings. The pupil will be told 
to put the NUMBER of the response selected in the answer space beside the test 
item. . 

7. The matching exercise should contain at least five and not more than fifteen items. 
Larger lists waste time and shorter lists increase the possibility of guessing the 
correct response. 

8. All the items for the matching exercise should be on a single page. Turning the 
page back and forth in search of desired responses is both confusing and time- 
consuming. 


G. Rearrangement Tests 


Appendix C, pages 452-455, contains remarks about the preparation and 
scoring of rearrangement (ranking, sequence, chronology, continuity) items 
and should probably be consulted at this point. Table 51 on page 455 makes 
scoring them rather simple, thereby eliminating one of the chief objections 
to this item type. 

Part II of the Allport-Vernon-Lindzey “Study of Values”® consists of 
15 ranking items, the examinee being asked to “Arrange these answers in 
the order of your personal preference by writing, in the appropriate box 
at the right, a score of 4, 3, 2, or 1. To the statement you prefer most 
give 4, to the statement that is second most attractive 3, and so on." Two 
of these items are: 


2. In your opinion, can a man who works in business all the week best spend 
Sunday in— 


a 
а. trying to educate himself by reading serious books E 
aba 
b. trying to win at golf, or racing 
Е 
Ei 
c. going to an orchestral eoncert 
À d 
d. hearing a really good sermon 
13. To what extent do the following famous persons interest you— 
a 
a. Florence Nightingale > 
3 b 
b. Napoleon Г] 
— 


% Devised by Gordon W. Allport, Phillip E. Vernon, and Gardner Lindzey, and 
published by Houghton Mifflin Company, 1951, 


SPECIFIC TYPES OF OBJECTIVE TESTS 191 


c. Henry Ford [] 
d. Galileo L] 


SELECTED REFERENCES FOR FURTHER READING 


Adkins, Dorothy C., and others, Construction and Analysis of Achievement Tests. 
Washington, D. C.: U. 8. Government Printing Office, 1947. 292 pages. 

Buros, Oscar K. (Editor), The Fourth Mental Measurements Yearbook. Highland 
Park, New Jersey: Gryphon Press, 1953. 1163 pages. š 

Ebel, Robert L., “Writing the Test Item,” Chapter 7 in E. F. Lindquist (Editor), 
Educational Measurement. Washington, D. C.: American Council on Education, 
1951. 

Ezell L. B. “A Device for Scoring Chronology Tests," Social Education, 13: 
329-331, November, 1949. 

Goodenough, Florence L., Mental Testing. New York: Rinehart & Company, 1949. 
Chapter 9, “The Analysis and Selection of Test Items.” 

Henry, Nelson B. (Editor), “The Measurement of Understanding,” Forty-Fifth 
Yearbook of the National Society for the Study of Education, Part I. Chicago: 
University of Chicago Press, 1946. 338 pages. 

Jordan, A. M., Measurement in Education. New York: McGraw-Hill Book Com- 
pany, 1953. Chapter 3, “Constructing Achievement Tests.” 

Odell, C. W., How to Improve Classroom Testing. Dubuque, Iowa: Wm. C. Brown 
Company, 1953. 156 pages. 

Stephenson, William, Testing School Children. New York: Longmans, Green and 
Company, 1949. Chapter VI, “Tests of Creative Imagination,” and Chapter VII, 
“Performance Tests." 

Travers, Robert M. W., How to Make Achievement Tests. New York: Odyssey Press, 
1950. 180 pages. 

Traxler, Arthur E.; Jacobs, Robert; Selover, Margaret; and Townsend, Agatha, 
Introduction to Testing and the Use of Test Results in Publie Schools. New York: 
Harper & Brothers, 1953. Chapter 4, “How Can Tests Be Selected?” 

Weitzman, Ellis, and McNamara, Walter J., Constructing Classroom Examinations 
—a Guide for Teachers. Chicago: Science Research Associates, 1949. 153 pages. 


if 


' Тһе Construction and Use of Essay 
Examinations 


a АН а оа солй 


Stalnaker! compares the merits of essay and objective tests in a thorough 
and impartial manner, concluding that both have considerable value when 
properly used. The summary to his chapter on page 530 is especially 
interesting: . 


The essay test has been the subject of repeated and often unfair attacks by 
psychologists and educationalists interested in the measurement of achievement as 
a Science. As a result, the essay test remains largely undeveloped, although it 
continues to be used widely by the classroom teacher. The values claimed for it 
have not been generally established, yet it may well be a basic test form which, 
properly controlled, can measure important outcomes of learning not yet otherwise 
measured, It also has other potential values which have been described. It has 
several important and unique advantages as an educative influence. The fact that 
it continues to be a test form widely used by the teacher preparing his own test 
would alone seem to justify further development and research, 


For many years the College Entrance Examination Board? has been con- 
cerned with the improvement of essay tests, particularly the increasing of 
scoring agreement on English compositions. Its journal, The College Board 
Review, published three times yearly, and the Annual Report of the Director? 
contain valuable reports of work with essay tests. L 

To limit the use of informal teacher-made tests to those classified as 
objective in type is an unwarranted restriction. The so-called traditional 

‘John M. Stalnaker, “The Essay Type of Examination,” Chapter 13 in Е. Е. Lind- 
quist (Editor), Educational Measurement. Washington, D. C.: American Council on 
Education, 1951. 


* Abbreviated CEEB and located at 425 West 117th Street, New York 27, М. Y. 
‘The 53rd annual report covers the period October 1, 1952-September 30, 1953, 


192 


Е 


| 


ESSAY EXAMINATIONS 193 


test or essay examination still has a legitimate place in the modern school. 
This chapter will consider some of the advantages and limitations of this 
type of test, and offer suggestions for its improvement and use. 


A. Limitations of the Essay Examination 


As ordinarily employed, the essay examination has certain serious limita- 
tions. It suffers in comparison with most forms of objective tests on the 
three important criteria of a satisfactory measuring instrument, validity, 
reliability, and usability. 

Low validity. In the first place, the essay examination as commonly 
used has low validity. Several factors contribute to this condition. The 
limited sampling of the essay examination is often pointed out. Ruch,‘ for 
example, produced evidénce to show that the essay called forth less than 
half the knowledge the average pupil actually possessed on the subject as 
determined by objective tests, and required twice the time to do it. The 
essay also includes many irrelevant factors, such as the quality of the spell- 
ing, handwriting, and English used, as well as bluffing, for which no cor- 
rection formulas exist. It has been suggested that the essay overrates the 
importance of knowing how to say a thing and underrates the importance 
of having something to say. In view of these limitations, the ordinary essay 
examination has little validity as an instrument of diagnosis. 

Low reliability. In the second place, the essay examination as com- 
monly used is low in reliability. Since short tests are usually less reliable 
than long tests, the narrow sampling afforded by essay examinations would 
tend to restrict its reliability. Still more serious is the subjectivity of 
scoring. Numerous studies have shown that teachers cannot agree with 
each other as to the values to be allowed examination papers of the essay 
type. Studies have also shown that the same teachers cannot agree with 
themselves on a second series of values assigned independently to the same 
papers. Part of this is due to different standards of marking and different 
weighting of the questions. Certain other factors, such as the physical and 
mental condition of the person marking the papers, also tend to condition 
the mark assigned a paper by a given teacher at any particular time. An 
English poet states the situation as follows:° 

"Twixt Right and Wrong the Difference is dim: - 
"Tis settled by the Moderator’s Whim: 


Perchance the Delta on your Paper marked 
Means that his Lunch has disagreed with him. 


In a study? made at the University of West Virginia, Ashburn came to 


* G. M. Ruch, The Objective or New-Type Examination, page 54. Chicago: Scott, 


Foresman & Company, 1929. Ў 3 2 
5 Quoted by D Kandel, Examinations and Their Substitutes in the United States, 
page 28. New York: Carnegie Foundation for the Advancement of Teaching, 1936. 
"Robert В. Ashburn, “An Experiment in the Essay-Type Question," Journal of 


Experimental Education; 7: 1-3, September, 1938. 


194 THE CONSTRUCTION OF TEACHER-MADE TESTS 


the conclusion that “the passing or failing of about 40 per cent depends, 
not on what they know or do not know, but on who reads the papers" and 
that “the passing or failing of about 10 per cent depends . . . on when the 
papers are read.” It has been observed that the scores tend to rise as time 
passes, and that the values assigned tend to be greatly influenced by those 
allowed the paper immediately preceding. For example, one writer asserts 
that, “А C paper may be graded B if it is read after an illiterate theme, 
but if it follows an A+ paper, if such can be found, it seems to be of D 
caliber.” 7 

That this situation is not peculiar to American education is indicated by 
the Examination Inquiry conducted by the International Institute of 
Teachers College, Columbia University.’ In fact, one writer? asserts that 
evidence showed the unreliability of essay examinations in Europe was 
"even more serious" than had been revealed many times in America. 
In support of this rather surprising conclusion, he says: “In the English 
Studies, examiners were found to reverse their judgments almost completely 
when asked to mark the same papers they had scored a year before." 

Bowles'? comments concerning England's examination system for col- 
lege entrance are illuminating. He concludes that it is quite deficient when 
judged by American standards of reliability and statistical validity, but 
that because of various safeguards for the individual “the system works” 
well. 

In fairness to the essay examination, however, it should be pointed out 
that many of the studies reported have been with unimproved forms of the 
examination given under unfavorable conditions. Often the essay exami- 
nation at its worst has been compared with an improved objective test. 

: Under such conditions, the former is bound to show up in an unfavorable 
light. If objective tests had been scored under similar conditions, without 
scoring rules or keys, the agreement of the scores would be less impressive. 
Asa matter of fact, even with scoring rules and keys, the agreement among 
the scorés on objective tests allowed by amateur scorers is far from perfect. 
Under favorable conditions the agreement among scorers of essay exami- 
nations approximates that reported for objective tests. One study !! reports 
that the average correlation coefficient between first and second scorings 
of an essay test in history by three experienced scorers was .98. Another 


7 John М. Stalnaker, “The Problem of the English Examination," Educational Record, 
17: 41, Supplement No. 10, October, 1936. 

* Published by the Bureau of Publications, 1936. f 

° W. Carson Ryan, Jr., “The Seventh World Conference of the New Education 
Fellowship, 1, School and Society, 44: 364. September 19, 1936. 

10 Frank H. Bowles, The College Entrance Examination Board. 51st Annual Report of 
the Director, 1951, pages 23-30. College Entrance Examination Board, 425 West 117th 
Street, New York 27, New York, 1952. 

и Коу E. Cochran and Charles С. Weidemann, “Improvement of Consistency of 
Scoring the ‘Explain’ and ‘Discuss’ Essay Examination," a paper read before Section C 
of the American Educational Research Association at Cleveland, Ohio, March 1, 1939. 


ESSAY EXAMINATIONS 195 


study ? reports that the median coefficient obtained between two independ- 
ent readings of certain College Entrance Board examinations was .97. All 
twenty of the coefficients were above .90, with the exception of English, 
which was .84. It must be kept in mind, however, that these examinations 
were so wotded as to make the scoring more objective than is usually pos- 
sible with ordinary essay examinations. 

It should be noted that most studies having to do with the reliability 
of essay examinations really show the reliability of marking the examinalion 
rather than the reliability of the examination itself. A few studies have been 
reported of the correlation between two forms of an essay examination 
designed for a particular purpose which were given to the same pupils and 
carefully marked by experienced examiners. McGregor and Ruch” used 
this procedure in studying eighth-grade examinations in sixteen subjects 
from 952 pupils in eleven states. Each paper in the two sets of examinations 
was marked independently by two experienced teachers. This study made 
it possible to compare the reliability of the examination with the reliability 
of marking the examination. The agreement of the two independent mark- 
ings of the same papers is represented by an average correlation of .62, 
while the agreement of the two sets of examinations marked by the same 
teacher is represented by an average correlation of only .43. One of Ruch’s 
students, Dr. W. E. Gordon," made a similar study of the New York 
Regents’ Examinations with startlingly comparable results. He found the 
average agreement of the two independent markings of the same papers 
was .72, while the average agreement of the two sets of examinations 
marked by the same teacher was only 42. Another study” conducted at 
the University of Chicago High School showed that two independent sets 
of marks assigned by two “experienced readers of essay examinations” 
agreed to the extent of .944 on Form A and .845 on Form B, but that the 
correlation between Form A and Form B was only .60. These three studies 
seem to agree on one important point: The reliability of marking the essay 
examination is higher than the reliability of the examination itself. 

Low usability. The essay examination also ranks low in usability. There 
seems no escape from the fact that this type of examination 1$ time- 
consuming, both for the pupil and for the teacher. In fact, the additional 
expenditure of time and energy over that needed for objective tests is so 
serious a limitation that the use of essay examinations can be justified only 
if it can be shown that the values realized are commensurate with this 


investment. 


12 John M. Stalnaker, “Essay Examinations Reliably Read,” School and Society, 
46: 671-672, November 20, 1937. БЕХ gei 

uG. M. Ruch, The Objective or New-Type Examination, pages 91-96. Chicago: Scott 
Foresman & Company, 1929. 


м Tbid., 97-98. TS : 
a Arthar T. Traxler and Harold A. Anderson, “The Reliability of an Essay Examina- 


tion in Euglish,” School Review, 43; 534-539, September, 1935. 


1900 THE CONSTRUCTION OF TEACHER-MADE TESTS 


B. Advantages of the Essay Examination 


Reliability and usability. Even the most enthusiastic advocate of 
essay examinations would scarcely claim their superiority over objective 
tests on the grounds of reliability or usability. The best that can be hoped 
for essay examinations is that by the use of improved techniques their 
reliability may approach that of objective tests. As regards usability, the . 
fact that the questions can be written on the blackboard is an advantage 
only in those schools which lack duplicating facilities. The reduction in 
time required to prepare essay examinations is more apparent than real, 
if the work is well done. Whatever advantage arises therefrom is more than 
offset by such considerations as the extra time demanded for giving and 
scoring. 

Validity. It is apparent that if the use of essay examinations can be 
justified it must be upon the ground of their superior validity for certain 
purposes. What, then, are the unique functions of these examinations? 

Unfortunately, upon this crucial issue little experimental evidence exists. 
One study indicated that about 30 to 40 per cent of the mental functions 
measured by improved essay tests of the “compare and contrast” type were 
not measured by true-false tests covering the same material. Two similar 
studies by Cochran and Weidemann” compared one-word fact tests and 
essay tests of the improved “explain” and “discuss” types covering the 
same material, and concluded that about 40 per cent of the mental func- 
tions measured by the latter were not measured by the former. The im- 

. poftant question of just what unique mental functions each type of test 
measures remains to be answered. 

In the absence of experimental evidence, it is necessary to fall back on 
logical considerations. The essay test appears to be useful for measuring 

` four objectives of instruction: functional information, certain aspects of 
thinking, study skills and work habits, and a functioning social philosophy. 
It will be noted that these objectives emphasize the functioning. rather 
than the mere possession, of knowledge. 

There would appear to be little justification for using essay tests for the 
recall of knowledge in piecemeal fashion. Sims," however, analyzed 458 
questions ordinarily classified as of the essay type, and found that fewer 
than half in the high school and fewer than one in five in the elementary 
school involved discussion, the others being almost equally divided be- 
tween simple-recall and short-answer questions requiring not more than 

8 ©. С. Weidemann and Lyndall Fisher Newens, “Does the ‘Compare-and-Contrast’ 
Essay Test Measure the Sàmé Mental Functions as the True-False Test?” Journal of 
General Psychology, 9: 430-449, October, 1933. 

V Коу E. Cochran and Charles C. Weidemann, “ ‘Explain’ Essay vs. Word Answer 
Fact Teat,” Phi Delta Kappan, 17: 59-61, 75, December, 1934; and “А Study of Spe- 
cial Types of Tests," Phi Delta Kappan, 19: 113-115, 131, January, 1937. 


“Verner Martin Sims, “Essay: Examination Questions Classified on the Basis of 
Objectivity,” School and. Society, 85: 100-102, January 16, 1932, 


ESSAY EXAMINATIONS 197 


one sentence for a response. The Evaluation Committee of the Seattle 
Schools? came to the conclusion that the evaluation of growth in language 
ability would require the use of several types of tests. 


Both objective and essay tests appeared necessary to measure achievement at 
the various levels of knowledge. Objective tests of the multiple-choice and matching 
type could be used to measure achievement at the recognition and recall levels. 
However, evaluating achievement at the level of interpretation and evaluation would 
require essay-type tests, as well as certain kinds of objective tests; evaluating 
achievement at the level of application would seem to be done most effectively by 
essay tests, since this would involve measuring the student’s ability to utilize 
information learned in one situation in the solution of problems in a new setting. 


One other advantage of the essay examination should he mentioned. 
Several experimental studies have shown that the type of measurement 
used by the teacher influences the type of study procedures employed by 
the pupils.? When pupils expect the test to be of the essay type, in whole 
or in part, they seem more likly to employ such desirable study techniques 
as making outlines and summaries, and seeking tQ perceive relationships 
and trends, than is done when objective tests are used exclusively. 

The practical conclusion follows that neither the essay nor objective test 
should be used exclusively. From Lee and Segel’s* analysis of the measure- 
ment practices of 1,600 secondary school teachers, distributed widely over 
the United States, and from the Hensley-Davis study? it appears that 
teachers favor the use of a combination of the two types. It is encouraging 
that the practice of more and more teachers seems to be governed by the 
sound philosophy of measurement stated by Lindquist in the following 
sentence:?* 

The intelligent point of view is that which recognizes that whatever advantages 
either type may have are specific advantages in specific situations; that while 
certain purposes may be best served by one type, other purposes are best served 
by the other; and, above all, that the adequacy of either type in any specifie 
situation is much more dependent upon the ingenuity and intelligence with which 
the test is used than upon any inherent characteristic or limitation of the гуре 


employed. 


C. Suggestions for Improving Essay Examinations 


Although the essay examination has been in existence for hundreds of 
years, the amount of research devoted to itis much less than that devoted 


— и Helen Е. Olson, “Evaluating Growth in Language Ability,” Journal of Educationat 
Research, 39: 247, December, 1945. | р 

20 For a fuller discussion of this point, see Chapter B 

21 J, Murray Lee and David Segel, Testing Practices of High-School Teachers, page 38. 
United States Office of Education Bulletin, No. 9, 1936. 1 

22 Туеп Н. Hensley and Robert A. Davis, “What High-School Teachers Think and 
Do About, Their Examinations," Educational Adminstration and Supervision, 38: 219- 
228, April, 1952. : Й 

23 Herbert Е. Hawkes, Е. Е. Lindquist, and C. R. Mann, The Construction and. Use of 
Achievement Examinations, page 20. Boston: Houghton Mifflin Company, 1936. 


1908 THE CONSTRUCTION OF TEACHER-MADE TESTS 


to the objective test, which is comparatively new. Furthermore, practically 
all the research relating to the former has been of a negative kind. Its pur- 
pose has been to show how poor unimproved essay examinations are, rather 
than to devise means for their betterment. However, a study of the meager 
experimental literature does yield several positive suggestions. The next 
two sections will be devoted to a consideration of some of the most promis- 
ing of these suggestions. 

Improving the construction and use of essay examinations. It is 
just as important to know where to use the essay examination as it is to 
know how to use it. It is wise to restrict the use of the essay test to the 
measurement of those functions for which it is best adapted. There would 
usually appear to be no good reason for employing subjective measurement 
where objective measurement is adequate. What, then, does the essay 
examination attempt to do? 

Weidemann™ recognizes eleven definable types of improved essay ex- 
aminations. Arranged in a series from simple to complex, these types are 
as follows: (1) what, who, when, which, and where; (2) list; (3) outline; 
(4) describe; (5) contrast; (6) compare; (7) explain; (8) discuss; (9) develop; 
(10) summarize; and (11) evaluate. The first two types seem hardly distin- 
guishable from recall tests of the objective type. Many years ago Monroe 
and Carter? made a very suggestive classification of thought questions 
into twenty types. These types, together with an illustration of each, taken 
from the field of measurement, appear below. 


Thought Questions 


1. Selective recall—basis given. 
Name three important developments in measurement which occurred during 
the first decade of the twentieth century. 


2. Evaluating recall—basis given. 
Name the three persons who have had the greatest influence on the develop- 
ment of intelligence testing. 


3. Comparison of two things—on a single designated basis. 
Compare essay examinations and objective tests from the standpoint of 
their effect upon the study procedures used by the learner. 


4. Comparison of two things—in general. 
Compare standardized and non-standardized tests. 


5. Decision—for or against. 
In which, in your opinion, can you do better, oral or written examinations? 
Why? 


^C. C. Weidemann, “Written Examination Procedures,” Phi Delta Kappan, 16: 
78-83, October, 1933; also, C. C. Wiedemann, “Review of Essay Test Studies," Journal 
of Higher Education, 12: 41-44, January, 1941, 

2 Walter 8. Monroe and Ralph E. Carter, The Use of Different Types of Thought 
Questions in Secondary Schools and Their Relative Difficulty for Students, 26 pages. 
Urbana, Illinois: Bureau of Educational Research Bulletin, Number 14, University of 
Illinois, 1923, 


10. 


fi. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


ESSAY EXAMINATIONS 199 


. Cause or effects. 


How do you aceount for the great popularity of objective tests during the 
last thirty-five years? 


. Explanation of the use or exact meaning of some phrase or statement in a 


passage. 
What is the meaning of “Delta” in the verse quoted on page 193? 


. Summary of some unit of the text or of some article read. 


Summarize in not more than one page the advantages and limitations of 
essay examinations. 


. Analysis (The word itself is seldom involved in the question). 


Why are many so-called “progressive educators” suspicious of standardized 
tests? 

Statement of relationships. 

Why is it that nearly all essay examinations, regardless of the school subject, 
tend to be measures of the learner's mastery of English? 

Illustrations or examples (your own) of principles in science, construction 
in language, etc. 

Give two original examples of specific determiners in objective tests. 
Classification (usually the converse of No. 11). 

What type of error appears in this test item? “With what Balkan country 
did the Allies fight in World War I?” 

Application of rules or principles to new situations. 

In the light of China’s experience with state examinations what would you 
expect to be the effect of the Regents’ Examinations in New York? 
Discussion. spa 

Discuss the place of measurement in science. 

Statement of aim—author’s purpose in his selection or organization of 
material. 

In view of the author's discussion on pages 19 and 20, why are so many 
authorities quoted in Chapter 1? 

Criticism—as to the adequacy, correctness, or relevancy of a printed state- 
ment, or a classmate’s answer to & question on the lesson. | 

Criticize or defend the statement. “The essay examination overrates the 
importance of knowing how to say & thing and underrates the importance 
of having something to say.” 

Outline. À Ы 

Outline the principal steps in the construetion of an informal teacher-made 
test. 

Reorganization of facts (a good type of review question to give training in 
organization). i 
Name ten practical suggestions from Chapters 4, 5, and 6 that are particu- 
larly applicable to.the subject you teach or plan to teach. 

Formulation of new questions—problems and questions raised. | 

What are some problems relating to the use of essay examinations that 
require further study? 

New methods of procedure. : t 

Suggest a plan for proving the truth or falsity of the contention that exemp- 
tion from examinations is a good policy in high school. 


200 THE CONSTRUCTION OF TEACHER-MADE TESTS 


. Special advantages. It will be noted that the classifications by Weide- 
mann and by Monroe and Carter recognize a considerable number of rather 
distinct abilities, which are measurable by essay tests. It is probably best 
to measure each one separately rather than to attempt to measure several 
of them by the same test. It will be further noted that the emphasis in most 
of these types is upon organization, relationship, evaluation, application, 
or some similar ability to which a purely objective test may be poorly 
adapted. Teachers should study each type of essay question carefully until 
they are familiar with its distinguishing characteristics. If a proposed essay 
question does not seem to conform to one of these types, it had usually 
better be reworded or adapted to some form of the objective test. No ques- 
tion should be included until its purpose has been clearly defined. 

The éssay examination would appear to be particularly valuable in two 
situations. The first of these is obviously in such courses as English com- 
position and journalism, where the student's ability to express himself 
effectively is the major objective of instruction. The second situation is in 
advanced courses of other subjects, where critical evaluation and the abil- 
ity to assimilate and organize large amounts of material constitute impor- 
tant objectives. In this connection it is significant to note that Jones” 
found that 68 per cent of the college students who took senior comprehen- 
sive examinations and 55 per cent of the superior students in other colleges 
stated their views as follows: “I think one's ability is far better shown 
through discussion questions than through short objective questions." 

There is some evidence that a more valid sampling of the pupil's knowl- 
edge is afforded by increasing the number of questions and reducing the 
length of discussion expected on each. In many cases a well-construeted 
paragraph is sufficient. Very few discussions need exceed one or two pages 
in length. In any case, the question should be so worded as to restrict the 
responses toward the objective which it is desired to measure. For example, 
Wrightstone suggests that the question, “Explain the reasons for the strike 
at General Motors in 1937,” is too general, and would be improved if it 
were restricted by the addition of the phrases “to show (a) the labor griev- 
ances of the employees; (b) the practices of the employer; (c) related na- 
tional, social, and economic factors; (d) the rival labor unions; and (е) the 
method of striking."? It must be recognized, however, that such sugges- 
tions, at least in part, take from the essay examination its uniqueness. 
The proposed modifications may appear to improve the reliability of the 
traditional examination by the obvious device of making it more like the 
objective test. 


26 Edward Safford Jones, Comprehensive Examinations in American Colleges, page 373. 
New York: The Macmillan Company, 1933. 

21 J, W. Wrightstone, “Ате Essay Examinations Obsolete?” Social Education, 1: 403, 
September, 1937. 


ESSAY EXAMINATIONS 201 


One of the difficulties with constructing essay tests is that the process 
appears so easy. As a matter of fact, it їз probably more difficult to construct 
essay tests of high quality than it is to construct objective tests of high quality. 
Much care and thought must be given to their construction, if tests of any 
kind are to measure anything but mere memory for factual knowledge. 
Many of the general principles of testing outlined in an earlier chapter are 
as applicable to essay tests as to objective tests. There is always risk that, 
in attempting to phrase essay questions so that they can be scored more 
objectively, the result may be made less satisfactory than an out-and-out 
objective test. Critical revision, utilizing, if possible, the judgment of a 
colleague, is especially important. i 

Preparing students to take essay tests. Some writers have empha- 
sized the importance of training pupils in taking examinations. Worcester” 
suggests that the essay examination is "obviously invalid and unfair," 
at least in part, because the pupils are being required to take a test on a 
type of work for which they have had no specific training. The rational 
solution offered is to supply the necessary training rather than to abandon 
the essay examination. Wider experience and training in preparing for and 
in taking tests of all types is likely to increase the accuracy of measurement. 
Edmiston” prepared instructions to pupils for taking examinations which 
were far more elaborate than the usual directions accompanying tests. 
He found that the use of these instructions increased the validity of the 
examinations and produced “definite improvements in students’ records 
of achievement from examinations.” It would appear wise to provide in- 
struction of this sort in the regular program of studies. It is most unfortu- 
nate when a pupil fails to receive recognition for knowledge he actually 
possesses simply because he has not mastered the technique of putting it 
on paper. Edmiston’s suggestions,” given below, will prove helpful in plan- 
ning such a program of instruction. 

Important CONSIDERATIONS IN TAKING EXAMINATIONS 
1. Your name should appear on the first or last sheet of the examination, if 
sheets are securely bound. Each loose sheet should have the name entered 
inconspicuously, preferably on back where it will not be seen by the scorer, 
when scoring. ` 

2. Write legibly. Your answer can't be right if it can’t be read. , . . Ве sure 
your pen or pencil (if allowed) fosters distinct and not blurred writing. 

Use terms or a vocabulary suited to the subject. Do not use a word unless 
its meaning is clear to you, and repeat a word rather than use another which 
may not have exactly the same desired meaning. 


DIA Worcester, “On the "Validity of Testing," School Review, 42: 527-531. Sep- 


tember, 1934. $ 
2 R, W. Edmiston, “Examine the Examination," Journal of Educational Psychology, 


30: 126-138, February, 1939. 
30 [bid., pages 137—138. 


g 


202 THE CONSTRUCTION OF TEACHER-MADE TESTS 


4. Space (the back of sheets, the margins, or an extra sheet) should be used for 
a. computations. 
b. practiee in the formation of desirable statements, not padded but fur- 
nishing quality rather than quantity to the answer. 
c. the hasty jotting of facts pertaining to some questions when these facts 
arise, while working upon another question. 


or 


. The statement of each question must be fully considered, Carelessness not 
only penalizes the student but also lowers the dependability of the measure- 
ment obtained by the instructor. 


e 


. The directions telling how to answer the questions should be carefully 
followed. Undersvore the important voints in the directions. 

7. In essay questions, underscore the part of the statement that furnishes the 
direct question asked. Then underscore any parts of the statement which 
furnish data for the answer. Number each part so that you will not omit 
anything from your answer. 

8. Proceed directly through the examination with no lengthy consideration of 
unfamiliar points. After completing the parts which were readily answered, 
start again and answer those questions which yield to more diligent effort. 
Do not waste time by trial and error method upon questions which bring 
no recognition or recall of related materials. After completing the second 
consideration of the test, spend the remainder of the time upon the more 
familiar of the unanswered questions. Note that hesitation wastes time, 
ruins confidence, and destroys mind-set. 

9. If after thorough consideration you do not understand some direction or 
question due to other than lack of knowledge of the course, call the attention 
of the person in charge with as little disturbance as possible in order that 
the tester may come to your seat or allow you to come to him as conditions 
may determine. 

10. Reread each answer before passing to the next question and the completed 
examination before delivery to the instructor. Is the meaning clear and 
writing legible? 

By way of summary, three important suggestions for the construction 

and use of essay examinations are as follows: 


1. Restrict the use of the essay examination to those functions to which 
it is best adapted. When it is not clear that the essay type is required for 
measuring the desired objective, use the objective test. 

2. Inerease the number of questions asked and reduce the amount of 
discussion required on each. Always indicate clearly the type of discussion 
desired. 

3. Make definite provisions for teaching pupils how to take examina- 
tions. Specific training in preparing for and in taking tests and examina- 
tions of the various types commonly encountered is a legitimate objective 
of instruction. 

Improving the scoring or grading of essay examinations. Rins- 
land* makes a distinction between the terms scoring and grading. Scoring 


*' Henry Daniel Rinsland, Constructing Tests and Grading in Elementary and High 
School Subjects, page 302. New York: Prentice-Hall, Inc., 1938. 


ESSAY EXAMINATIONS 203 


is an objective process of counting right or wrong responses, whereas grad- 
ing always means interpreting quality in terms of some criterion. Strictly 
speaking, then, it is more correct to speak of grading or rating essay ex- 
aminations than it is to speak of scoring them. 

It is, of course, apparent that whatever claims are made for the validity 
of the essay test as a measuring instrument are conditioned upon the as- 
sumption that the papers can be read accurately. Not only must the essay 
test, for example, call forth from superior pupils responses which are con- 
sistently superior, but the teachers marking the papers must be able 
consistently to recognize that they are superior responses. 'The same is true 
of responses with other degrees of merit. The grading of the essay exami- 
nation, therefore, occupies a strategie position. 

To begin with, certain preventive measures are important. A careful 
wording of the questions and directions to the pupil which indieate clearly 
just what type of response is expeeted will simplify the problem of marking 
the papers. The use of optional questions should be discouraged. The simple 
precaution of having the pupil record his name inconspicuously either on 
the back or at the end of the paper, rather than at the top of each page, 
is likely to increase the accuracy with which the paper is graded. 

Cochran and Weidemann? outline a procedure for evaluating essay ex- 
aminations, the essentials of which can be taught in ten minutes. This is 
shown by the fact that the majority of the consistency coefficients of two 
series of scorings made five weeks apart were between .80 and .90 for 
teachers with ten minutes of training. Independent scores by experienced 
readers showed an average agreement of .98 when the procedure given 
below in a slightly modified and abridged form was used. 


SUGGESTIONS ков Marxina Essay EXAMINATIONS 
(After Cochran and Weidemann) 


1. I read over a sampling of the papers to obtain a general idea of the grade of 
answer I may expect. 

2. I score one question through all of the papers before I consider another question. 
I have found two outstanding advantages in scoring one question through an 
entire set of papers. The first is that the comparison of answers appears to 
make the grades more exact and just. The second is that having to keep only 
one list of points in mind saves time and promotes accuracy. —. 

3. Before scoring any papers I read the material in the text which covers the 
questions, and also the lecture notes on the subject. 

4, I make a list of the main points which should be discussed in every answer. 
Each of these points must be weighed and assigned a certain value if the scoring 
is to approach accuracy. This value assigned to the main points needed for a 
reasonably adequate answer is designated as the minimum score. If a pupil 
elaborates and discusses points not required yet pertinent to the question, his 
answer is given an additional value, called the extra score. This extra score may 
vary for different pupils, but may not exceed a certain set maximum. 


з John M. Stalnaker, “Тһе Essay Type of Examination,” op. cit., pages 505-506. 
з Roy E. Cochran and С. C. Weidemann, op. cil. 


204 THE CONSTRUCTION OF TEACHER-MADE TESTS 


5. After the points have been weighed, the actual scoring begins. I read the answer 
through once and then check back over it for fact details. I attempt to mark 
every historical mistake on the paper and write in briefly the correction. As I 
read the answer I make a mental note of the points omitted and the value of 
each point, so that when the end of the question is reached, I have the minimum 
grade figured. If there is any additional or extra percentage to be giv en, it is 
added to the minimum score, and then the value of the question is written 
in terms of the per cent deducted rather than the positive per cent. Then when 
every guestion on a paper is scored, it is a simple matter to add the negative 
quantities and obtain the final grade. 


It is difficult to overemphasize the importance of three things: (1) the 
preparing in advance of a list of answers which are considered adequate 
for the objectives of the test; (2) the assigning of a specific value to each 
essential part of the answers; and (3) the grading of one question through 
all the papers before going on to another question. Most students of the 
problem recommended attempting to distinguish a relatively small number 
of degrees of merit in an answer. Perhaps as good a plan as any is to allow 
credit for each part of the answer considered essential to a question as fol- 
lows: 3 for superior, 2 for average, 1 for inferior, and 0 for an omission or 
wrong reply. Stalnaker* found that the weighting of essay questions was 
of negligible value—the correlation between weighted and unweighted 
scores on the College Entrance Board Examinations varying from .97 
to .997. 

Grading by sorting. In addition to the points made by Cochran and 
Weidemann, several authorities have found another suggestion helpful. 
The suggestion is to make a sorting of the papers into three to five piles, 
according to the merit of the discussion of each question on the basis of a 
brief preliminary examination of the answers. Sims describes one such pro- 
cedure as follows:* 


1. Quickly read through the papers and on the basis of your opinion of their 
worth sort them into five groups as follows: (a) very superior papers, (b) su- 
perior papers, (e) average papers, (d) inferior papers, (e) very inferior papers. 

2. Reread the papers in each group and shift any that you feel have been misplaced. 


Flanagan? has shown that the optimal percentages for five groups are 
9, 20, 42, 20, and 9. Therefore, about 10 per cent of the papers might be 
called "very superior" and 10 per cent "very inferior." Twenty per cent 


** John M. Stalnaker, “Weighting Questions in the Essay-Type Examination,” Jour- 
nal of Educational Psychology, 29: 481—490, October, 1938. 

3 Verner Martin Sims, “The Objectivity, Reliability, and Validity of an Essay 
ENT Graded by Rating," Journal of Educational Research, 24: 216-223, Oc- 
tober, 1931 

? John C. Flanagan, “The Effectiveness of Short Methods for Sag аа Correla- 
tion Coefficients,” Psychological Bulletin, 49: 342-348, July, 1952. 


ESSAY EXAMINATIONS 205 


would be "superior" and a like percentage would be "inferior." The re- 
maining 40 per cent are "average." These are rough approximations, of 
eourse, dependent upon the ability level of the partieular student group 
being graded. 

The preliminary sorting of the papers into piles of approximately equal 
merit before assigning numerical values to them will help to avoid the 
difficulty pointed out by Stalnaker: namely, that the values allowed a 
paper are often greatly influenced by the merit of the paper which happens 
immediately to precede it in the order of scoring. It is also easier to locate 
papers distinetly out of line with those in a particular group supposedly 
of similar quality. It is a good idea to throw the papers into a single group 
after each question has been evaluated and before they are re-sorted into 
piles according to the merits of the discussions of the next question. This 
procedure will make it easier to conceal the identity of the particular pupil 
whose paper is being judged and so to avoid one of the most disturbing 
factors in marking essay examinations. 

The school should adopt a policy regarding what factors shall be con- 
sidered, and what factors shall not be considered, in evaluating a written 
examination. Only those factors should be taken into account which afford 
evidence of the degree to which the pupil has attained the objectives set up for 
that particular course. Except in English classes, this will rule out making 
arbitrary reductions for such things as faulty sentence structure, para- 
graphing, handwriting, and the spelling of nontechnical words. These fac- 
tors will be considered only in so far as they affect the clarity of the 
pupil’s discussion. It is always legitimate to hold the pupil responsible 
for the spelling, as well as the meaning, of the vocabulary which is peculiar 
to the course. 

This does not mean that the quality of the written English used in ex- 
aminations is unimportant and should therefore be disregarded. On the 
contrary, it is always very important. But it should be considered only in 
relation to that for which it may be accepted as valid evidence: namely, 
in determination of the pupil’s mark in English. Where the teacher has 
complete charge of an entire grade, this adjustment is easy to make. But 
where the school is departmentalized the problem is more difficult. Even 
here it should be possible to work out a system whereby at intervals the 
papers in other subjects, after having been graded as to content, may be 
turned over to the English teacher to be judged from the viewpoint of their 
merits as English compositions. In this way it may be possible to sample 
the pupil’s characteristic performance in written English better than when 
he writes a paper specifically for the English teacher. And, what is equally 
important, it makes the pupil’s mark in other subjects a measure of achieve- 
ment in those subjects, rather than partly a measure of skill in English 
composition. 


200 THE CONSTRUCTION OF TEACHER-MADE TESTS 


SELECTED REFERENCES FOR FURTHER READING 


College Board Review, “Reading Conference," No. 19: 324-326, February, 1953. 

Cook, Walter W., “The Functions of Measurement in the Facilitation of Learning,” 
Chapter 1 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Ellis, Albert, *An Experiment in the Rating of Essay-Type Examination Questions 
by College Students," Educational and Psychological Measurement, 10: 707—711, 
Winter, 1950. 

Flanagan, John C., “The Use of Comprehensive Rationales in Test Development,” 
Educational and Psychological Measurement, 11: 151-155, Spring, 1951. 

Henry, Nelson B. (Editor), “The Measurement of Understanding," Forty-Fifth 
Yearbook of the National Society for the Study of Education, Part I. Chicago: 
University of Chieago Press, 1946. 338 pages. 

Kostick, Max M., and Nixon, Belle M., “How to Improve Oral Questioning,” 
Peabody Journal of Education, 30: 209-217, January, 1953. 

Odell, C. W., How to Improve Classroom Testing. Dubuque, Iowa: Wm. C. Brown 
Company, 1953. Chapters V and VI, “Discussion of Essay Examinations" and 
"Short-Answer Tests: General." 

Remmers, H. H., and Gage, N. L., Educational Measurement and Evaluation. 
New York: Harper & Brothers, 1943. Chapter XII, “Essay Testing." 

Stalnaker, John M., “The Essay Type of Examination," Chapter 13 in E. Е. Lind- 
quist, Educational d Washington, D. C.: Ameriean Council on Edu- 
eation, 1951. 

Torgerson, Warren $., and Green, Bert F., Jr., “The Factor Analysis of Subject- 
Matter Experts," Journal of Educational Psychology, 43: 354-363, October, 1952. 


PART Ш 


The Testing Program 


Steps in the Testing Program 


Dear Professor: 

I have decided to give some tests in my school this fall. Please suggest a few 
good ones I might try. Also let me know where to get them and what they will 
cost. 


Dear Professor: 

We gave the Up-to-Date General Achievement Tests at the beginning of the 
school year. As we now have most of them scored, please advise me how to use 
the results so as to get the most good out of them. Any help will be greatly 
appreciated. 


Probably every college professor who offers courses in measurement has 
received letters like those above. They indicate that some school is under- 
taking, or has already undertaken, to use standard tests without under- 
standing what it is all about. Always, testing should have a program to 
guide it.1 What, then, is a “testing program"? 

General considerations. The word “program” has certain important 
implications, such as order, system, planning. It implies a sequence of events 
that has been determined upon after careful thought, rather than some 
haphazard, hit-or-miss affair. One of the chief weaknesses of many at- 
tempts to use standard tests is that there has been no program worthy 
of the name. The whole procedure has simply led a precarious hand-to- 
mouth existence from beginning to end. 

Spence? has suggested that “a good testing program should be supple- 
mentary not duplicative, usable not confusing, economical not burden- 


1 Julian C. Stanley, “Standardized Tests and Educational Objectives," Peabody Jour- 
nal of Education, 28: 218-221, January, 1951. — Я 
? Ralph B. Spence, “А Comprehensive Testing Program for Elementary Schools, 
T'eachers College Record, 34: 279-284, January, 1933. 

209 


210 THE TESTING PROGRAM 


some, comprehensive not sporadic, suggestive not dogmatic, progressive 
nol statie." Such a program, at least in tentative form, may very well cover 
an extended period, rather than be adopted piecemeal year by year. One 
advantage of this long-range planning is that it makes possible a varied 
program without leaving gaps or involving needless duplication. Stenquist,’ 
speaking from wide experience, strongly advocates “some sort of system- 
atically recurring schedules as opposed to sporadic testing,” since schedules 
make possible ‘enormously greater gains" from testing. Spence offers for 
elementary schools what he calls “a conservative approach to the problem.” 
This program is given in Table 26. 


TABLE 26 


PLAN FOR A TESTING PROGRAM ков THE ELEMENTARY SCHOOL (AFTER SPENCE) 


ACHIEVEMENT TESTS FOR 
SPECIAL EMPHASIS 
ALL GRADES FROM 3 OR 
4 vo 8 ROTATING 
(GIVEN IN OCTOBER)? 


ACHIEVEMENT BATTERY 
ANNUAL 
(GIVEN 1N 
МАВ. ов Apr.)> 


GRADE | INTELLIGENCE* 


Кав. —І | Two Group Tests 


II Reading Battery 
IH Skill Subjects Battery First. Year—Reading 
IV | One Group Test | Complete Battery Second Year—Arithmetic 
NI Third Year—Social Studies 
VI Complete Battery Fourth Year—Language 
VII Usage and Spelling 
VIII Complete Battery Fifth Year—Reading, etc. 


з Retests for special cases as needed, preferably with an individual test. у 
b All dates based on groups beginning a grade in September. Teachers use diagnostic 
tests throughout the year. | 


It will be noted that this program calls for the use of both intelligence 
tests and achievement tests, and for the use of test batteries as well as of 
tests in the separate subjects. It is also expected that the program will 
merely supplement rather than supplant the ordinary informal tests and 
examinations made by the classroom teacher. A slight modification of the 
schedule as presented would involve giving a general test battery in all 
subjects about every third year, and an intensive program limited to one 
subject in each of the intervening years. The cost of such a program of 
standard tests would be less than twenty-five cents per pupil per year. 
If the tests are intelligently used, it is doubtful whether greater returns 
can be had by the school from the same amount of money spent in any 
other way. 


з John L. Stenquist, "Recent Developments in the Uses of Tests," Review of Educa- 
tional Research, 3: 60, February, 1933, 


STEPS IN THE TESTING PROGRAM 211 


Traxler,‘ in giving a practical discussion of the planning and administra- 
tion of the testing program, divides tests into two broad categories. The 
first includes group tests of intelligence and achievement tests in the major 
subject-matter areas. These should be administered at regular intervals to 
every normal pupil in the school. The second category includes individual 
intelligence tests, special aptitude tests, personality tests, and tests of voca- 
tional interest. 

The following comprehensive “Platform for the Use of Standard Tests” 
has been prepared by a committee of Massachusetts teachers:* 


— 


. Scientific measuring instruments and the scientific method are badly needed in 
present, educational practice. No business of the financial magnitude of edu- 
cation spends so little time and money for objective and scientific fact finding. 

. Standardized tests and measurements can fulfill their function of giving direc- 
tion and effieieney to education only when used intelligently by teachers and 
administrators who have kept abreast of current knowledge on the subject 
and who are willing to follow the authors' directions for the administration 
and scoring of the tests used. The results of tests in which directions are not 
followed are worse than useless; they are misleading. 

3. Every standardized test administered should be given for a specifie purpose, 
and having been given, its results should be used. Tests which are administered, 
scored, and piled in a eupboard serve no useful purpose. 

. Standardized tests can be used most efficiently when their use is planned over 
a long period of time. 

. Standardized tests have furnished valuable information to the school adminis- 
trator in practically every instance in which they have been used. The pos- 
sibilities of diagnostic tests in improving instruction through analysis and 
diagnosis of individual and class weaknesses have not nearly been realized. 
Tests are of the greatest value when their results cause a teacher to redefine 
his objectives, alter his methods, and redirect his emphasis as a result of new, 
increased, and more exact knowledge about his pupils. 

. If standardized test results are to be used in measuring the efficiency of in- 
struction, the conditions of scientific experimentation must, prevail with all 
contributing factors defined, measured, and controlled. Failure to observe these 
conditions often results in teaching for test results alone, which not only in- 
validates any results which may be obtained, but also neglects some of the 
most desirable outcomes of good teaching which cannot be measured by tests. 
On the other hand, standardized test results cannot be ignored. They can be 
of great help to an administrator in judging a teacher’s work, but they cannot 
be used as a substitute for classroom visiting, supervision, and critical sub- 
jective analysis. ne T 
7. No important decision regarding the placement of an individual pupil should 

be made on the basis of the result of one test of any kind. Educational achieve- 
ment, mental age, I.Q., chronological age, health, teacher's judgment, physical 
development, social age, and emotional maturity are all factors to be con- 
sidered in individual placement or any plan for grouping. 


‘ Arthur E, Traxler, Planning and Administering a Testing Program,” School Review, 


48: 253-267, April. 1940. 5 E 
* “Тһе Use of Standardized Tests in Massachusetts,” Test Service Bulletin No. 38, 


published by World Book Company, 1938. 


[24 


A 


or 


D 


212 THE TESTING PROGRAM 


8. The content or items of a standardized test should never be used as material 
for class presentation and drill either before or after the administration of the 
test. To reproduce any part of the test, either on paper or the blackboard, 
is not only a violation of the publisher’s copyright, but will invalidate that test 
for future use in the school. For this reason, all copies of standardized tests 
should be accounted for, and extra copies should not ordinarily be left in the 
hands of the classroom teacher. 

9. The I.Q. or mental age obtained from one group test of intelligence is less 
reliable than an average of I.Q.’s or mental ages obtained from the results of 
two or more group tests of intelligence. An individual test of intelligence is 
more valid and reliable than group tests only when it is administered by a 
skillful and well-trained psychometrist. 

10. The use of standardized tests and a knowledge of the methods used in their 
construction should result in an improvement in teacher-made measures of 
achievement. 


One must not assume that the testing program should be restricted to 
the use of standard tests. As has been explained in the three preceding 
chapters, informal or teacher-made tests will have a large place in any complete 
testing program. Schools should have a carefully thought out general policy 
on such matters as the frequency of testing, the importance of final exami- 
nations, the factors to be considered in determining final marks, and, most 
important of all, the uses to be made of the results. 

Regardless of its scope, the complete testing program at any particular 
time will ordinarily consist of the following eight steps, or stages, in chron- 
ological order: 


1. Determining the purpose of the program, 

. Selecting the appropriate test or tests. 

. Administering the tests. 

. Scoring the tests. 

. Analyzing and interpreting the scores. 

. Applying the results. 

. Retesting to determine the success of the program. 
. Making suitable records and reports. 


очо Ot WD 


A. Determining the Purpose of the Program 


It must be recognized at all times that tests are only tools, and that 
measurement is always a means to an end, never an end itself. In the final 
analysis, then, the value of any testing program depends upon the use made 
of results. Unless something is going to be done about it in the end, there 
is no point to beginning. Merely "giving tests" without rhyme, rule, or 
reason is money, time, and effort wasted. The author once heard an ex’ 
perienced educator say that he had wondered for years what many people 
did with standard tests after they had been "given." At last he found out: 
They filed them! The testing program should have a more serious purpose 


mua 


STEPS IN THE TESTING PROGRAM 213 


than that. The first step, therefore, in planning a program is to determine 
its purpose. In so doing, three things should be kept in mind: 


1. It should be co-operative. 
2. It should be practical. 
3. It should be definite. 


A co-operative program. As a rule, the program should not represent 
the judgment of any one person alone, but that of a group. It should be a 
truly co-operative enterprise. The teachers and administrative officers alike 
should be made to feel that it is “our” program, as, indeed, it should be. 
This is not likely to be the case, however, if the principal, superintendent, 
or research department determines the program and then “hands it down” 
to the classroom teachers. The entire staff should have a voice in deter- 
mining the purpose of the program and in formulating the plans, and all 
should have the opportunity of participating in it in every way possible 
from beginning to end. If this is not done, the teachers are not likely to un- 
derstand the program fully or to appreciate what it is attempting to do. 
Without the hearty co-operation of the entire staff, from the superintendent 
to the youngest teacher, the program is almost sure to fall short. of its 
highest possibilities. It is suggested, therefore, that in the small school or 
school system the purpose of the program be decided upon after discussion 
in a general teachers’ meeting or series of meetings in which everyone has 
a chance to participate. In the larger school systems it is better to entrust 
a committee representing all interested groups with the responsibility of 
planning the program. Even then jt should be brought before the entire 
staff before final action is taken. It cannot be emphasized too strongly 
that the success of the program largely depends upon co-operative action. 
An important part of the program, therefore, is the educating of the staff 
so that they can participate intelligently in it. Boyer emphasizes the fact 
that the teacher’s attitude is the most important factor to be considered 
in any plan, for “what she thinks and what she does as a result of her 
thinking, determines the success or failure of the plan."'* 

A practical program. The general purpose of the testing program is 
to provide data which will help in the solution of some practical school 
problem. As a rule, this means that the problem whose solution is sought 
will have to do with administration, instruction, or research, or with some 
combination of these three. Even when tests are used primarily for ad- 
ministrative purposes, such as classification, they can also be used by the 
classroom teachers for diagnostie purposes. Unless the school has had con- 
siderable experience with testing, it will be better not to undertake a pro- 
gram primarily for research, although under favorable conditions research 


è Thirty-Fifih Yearbook of the National Society for the Study of Education, Part I, 
page 213, E Illinois: Public School Publishing Company, 1936. 


214 THE TESTING PROGRAM 


is a legitimate interest both of classroom teachers and of administrators, 
Even when the program is undertaken for research purposes, it should 
ordinarily be one which bears directly upon some practical issue in the 
school, such as determining the relative efficiency of different teaching 
methods or of administrative organizations. 

A definite program. It is not enough that the program be co-operative 
and practical. It must also be definite. The scope of the program may vary 
all the way from a single subject in one grade to a complete measurement 
of the entire school system. A common mistake of a staff inexperienced in 
the use of tests is to undertake too much. The danger then is that the pro- 
gram will drag along until everybody is more or less “fed-up” with it. 
Mueh of the value of the information sought from the tests will be lost 
unless the information is made available without delay. It is usually best, 
particularly with inexperienced teachers, to run the risk of undertaking too 
small a program rather than one too large. 

Another mistake is in stating the purpose of the program in too general 
terms. “To improve instruction” is too vague and inclusive. “To motivate 
study" or “to diagnose weaknesses and provide a basis for remedial in- 
struction” would be better. Best of all would be a still more definite formu- 
lation, such аз “to motivate study in fifth-grade arithmetic" or “to make 
a diagnosis of characteristic weaknesses in first-year algebra and to formu- 
late a program of remedial teaching to strengthen them." ‘The purpose 
should state specifically both the nature and the scope of the program to be 
undertaken. Later chapters will diseuss in some detail important admin- 
istrative and instructional problems which tests may help to solve. In a 
long-time program the purpose for each year will have a definite relation- 
ship to the whole. No matter how stated, however, there is really one 
fundamental purpose in all measurement: namely, the better understanding 
of the individual pupil. To accomplish this purpose the information must 
be as definite and as complete as possible. 


B. Selecting the Appropriate Test or Tests 


When the purpose of the testing program has been determined, and not 
until then, the selection of the test, or tests, is in order. In Chapter 4 
attention was called to the fact that a test may be superior for one purpose 
and worthless for another. Great care must therefore be exercised in order 
to secure the tests most appropriate for the purpose. Three questions re- 
quire consideration: 


1. Who shall select the test or tests? 
2. What type of tests shall be used? 
3. What is the best procedure in making the selection? 


Who shall select the tests? The best qualified person, or persons, 
available should make the selection. In larger school systems the director 


STEPS. IN THE TESTING PROGRAM 215 


of research is usually that person. But, even then, in the selection of 
achievement tests for specifie subjects, the teachers of these subjects should 
be consulted, as their knowledge is essential in judging the curricular va- 
lidity of the tests. In smaller schools the major responsibility is usually 
entrusted to the principal or superintendent." However, in the selection of 
achievement tests a committee of teachers will be helpful in judging the 
content of the tests. It is a sound principle in all evaluation that involves 
a subjective element to rely, whenever possible, upon the combined judg- 
ment of a group of competent persons rather than upon the judgment of 
any one individual. 

What type of tests shall be used? Ordinarily an adequate testing 
program will involve the use of more than one type of test. It will be desir- 
able, except in a few eases such as in the beginning of the kindergarten or 
first grade, to use both intelligence and achievement tests. If considerations 
of time and money make it advisable to limit the testing program to one 
standard test for determining the present status of the class or school, the 
best choice will usually be a test battery.’ 

For a general survey of the intellectual status of the class or school, 
a good group test of intelligence will suffice, although as a rule an average 
of two is better than one alone. In any measurement of intelligence involv- 
ing group tests, especially if only one test is used, it is desirable to have 
retested with an individual intelligence test, such as the Revised Stanford- 
Binet or the Wechsler-Bellevue, the following pupils: those who test very 
low, say below an IQ of 80; those who test very high, say above an IQ 
of 130; or those whose scores are considerably out of line with the judgment 
of the teacher. The Revised Stanford-Binet is particularly trustworthy at 
the low IQ levels. The distinctive advantage of the individual intelligence 
test is the opportunity afforded for the examiner to observe the behavior 
of the child under standardized conditions. As a diagnostic instrument such 
a test is likely to be much superior to the group test. Pupils who have 
language difficulty should be tested individually, perhaps with a perform- 
ance test. 

A reasonably complete testing program will require, as a rule, the use of 
intelligence tests along with achievement tests. Because of the relative 
constancy of the IQ it is unnecessary to administer intelligence tests each 
year. The mental level of most pupils can be predicted closely enough from 
intelligence tests periodically scheduled to permit ordinary comparisons 
with achievement. Page 225 outlines intelligence testing programs adapted 
to various types of school organization. At times, aptitude tests in specific 
fields, rating scales, check lists, personal interviews, and the like will also 
` 1 State-wide surveys in Massachusetts and New Jersey indicate this clearly. See Test 


Service Bulletins No. 88 and No. 42, World Book Company. — 04 

? Nearly every major test publisher has such a battery. The interested administrator 
or other person responsible for helping with the planning of testing programs will want 
to have test catalogues of the companies listed with asterisks in Appendix F, page 464. 


216 


THE TESTING PROGRAM 


TABLE 


ADVANTAGES AND LIMITATIONS OF STANDARDIZED 


CRITERION 


1. Validity 
a. Curricular 


b. Statistical 


STANDARDIZED 


Advantages 


Limitations 


Careful selection by compe- 
tent persons after experimen- 
tation. 

Fit typical situations. 


With best tests, high. 


Inflexible. Too general in scope 
to meet local requirements 
fully, especially in. unusual 
situations. 


Criteria often defective. Size 
of coefficients largely depend- 
ent upon range of ability in 
group tested. 


2. Reliability 


With best tests. very high; 
usually above .90, often 
above .95. 

Usually fully objective. 


No guarantee of validity. De- 
pends upon range of ability in 
group tested. 


3. Usability 
a. Ease of Adminis- 
tration 


b. Ease of Scoring 


c. Ease of Interpre- 
tation 


Summary 
Main Points, Pro and 
Con 


available. 


Definite procedure, time lim- 
its, ete. Y 
Economy of time. 


Definite rules, keys, etc. 
Largely routine, 


Better tests have adequate 
norms. Useful basis of com- 
parison. Equivalent forms. 


Manuals require study and 
are sometimes inadequate. 


May take considerable time. 
Monotonous. 


Norms often confused with 
standards. Some norms defec- 
tive. Norms for various types 
of schools and levels of ability 
are often lacking. 


Convenience, comparability, 
objectivity. 
Equivalent forms may be 


Inflexibility. 


be required. The particular combination of measuring techniques required 
in any given situation will depend upon the specifie purposes to be served. 
As a rule, classroom teachers will find a larger place for nonstandardized, 
teacher-made tests in the solution of instructional problems than will 
school administrators in the solution of administrative problems. The re- 


STEPS IN THE TESTING PROGRAM 217 
27 
AND NONSTANDARDIZED TESTS OF ACHIEVEMENT 


CA——M—————M——————————M—————————————————— 
NONSTANDARDIZED 


ESSAY 


Advantages 


Useful for English, 
advanced classes; 
afford language 
training. May en- 
courage sound 
study habits. 


Inexperienced 
teachers may do 
better than with 
objective types. 


Limitations 


Limited sampling. 
Bluffing is possi- 
ble. Mix language 
factor in all scores. 


Usually not known. 


Average is low. 
Subjective scoring. 


Easy to prepare. 
Easy to give. 


Useful for part of 
many tests and in 
a few special fields. 


Lack of uniform- 
ity. 


Slow, uncertain, 
and subjective. 


No norms. 
Meaning doubtful. 


Limited sampling. 
Subjective scoring. 
Time consuming. 


OBJECTIVE 


Advantages 


Extensive sampling 
of subject matter. 
Flexible in use. Dis- 
courage bluffing. 
Easier to prevent 
and to detect cheat- 


ing. 
Compares favorably 
with standard tests. 


Sometimes ap- 
proaches that of 
standard tests. 
Objective scoring. 


Directions rather 
uniform. 
Economy of time. 


Definite rules, keys, 
ete. 
Largely routine. 


Local norms can be 
derived. 


Extensive sampling. 
Objective scoring. 
Flexibility. 


Limitations 


Narrow sampling of 
functione tested. 
Negative learning 
possible, 

Piecemeal study en- 
couraged. 


Adequate criteria 
usually lacking. 


No guarantee of va- 
lidity. 


Difficult to prepare. 


May take consider- 
able time. 
Monotonous, 


No norms ayailable at 
beginning. 


Preparation requires 
skill and time. 


шаро MM 


verse condition will tend to be true for standardized tests. Table 27 is a 
sort of *balance sheet" which briefly summarizes some of the chief advan- 
tages and limitations of various types of achievement tests. Tt is evident 
that there is a legitimate place for all kinds of tests, but no one test is 


equally good for all purposes. 


218 THE TESTING 


PROGRAM 


What is the best procedure? Regardless of the purpose of the testing 
program or who makes the selection of tests, it is important that a system- 
atie; businesslike procedure be employed. Users of standard tests will find 
the information contained in The Mental Measurements Yearbooks of great 
value. The comprehensive character of the tests reviewed in this publica- 


TABLE 28 
CLASSIFICATION oF Tests IN The Fourth Mental Measurements Yearbook 
(1953) 
TEST TEST 
ACHIEVEMENT BATTERIES.. 1 SAFETY EDUCATION............ 521 
CHARACTER AND PERSON- TESTING PROGRAMS............ 526 


ALITY 
NONPROJECTIVE 
PROJECTIVE 

ENGLISH 
COMPOSITION. . 
LirERATURE... 
SPELLING 


Greek... 
ITALIAN 


MATHEMATICS 
ALGEBRA 


MISCELLANEOUS 
AGRICULTURE 441 
Business EpucaTion.......... 443 
CoMPUTATIONAL AND SCORING 


ETIQUETTE 
HANDWRITING 


RECORD AND REPORT Forms... 
Клллетооз EDUCATION 


m 


SrEciAL FrELDS 
Srupy SKILLS 


Свотоах 


SOCIAL STUDIES 
Economics 


NURSING. ... 
SALESMEN 


STEPS IN THE TESTING PROGRAM 219 


tion is indicated by the “Classification of Tests" in The Fourth Mental 
Measurements Yearbook, which is shown in Table 28.* 

As an illustration of the type of evaluations in this volume, the following 
excerpts from comments on the Primary Mental Abilities tests are given." 

Anne Anastasi, Professor of Psychology at Fordham University, criti- 
cizes the PMA tests on the basis of the methods used to estimate their 
reliability : 

A special weakness of the entire PMA series is the treatment of test reliability. 
In tests such as these, designed for intra-individual comparisons and profile analysis, 
the need for proper determination and reporting of reliability is particularly urgent. 
Yet in the various forms of the PMA tests, reliability coefficients are either in- 
adequately reported, incorrectly computed, or completely omitted. Odd-even and 
Kuder-Richardson techniques have been repeatedly employed in finding the reli- 
ability of speeded tests [for which they are not suitable]. In several forms, no 
recognition is given to this problem at all, spurious and meaningless reliabilities 
as high as .98 being reported without comment, except to say that the reliabilities 
would probably be higher in more heterogeneous samples. 


Ralph F. Berdie, Professor of Psychology and Director of the Student 
Counseling Bureau at the University of Minnesota, does not deliver a 
favorable final verdict: 


In general, one would expect these tests to be a great contribution to education 
and guidance. That they have not been may be due either to the test itself or to 
the inadequate follow-up work that the authors or others have done. It may be 
that in attempting to produce a test that requires relatively little time or money, 
the publishers have sacrificed those very things that made the tests potentially 
valuable, It is too bad that after such tests have been available for more than 
14 years, one must still conclude that their principal uses are experimental. 


John B. Carroll, Associate Professor of Education at Harvard University, 
is more complimentary: 

The authors are undoubtedly on solid ground in their discussions of the Verbal- 
Meaning and Reasoning factors. In all probability, the statements which the 
authors make about the Number and the Space factors and their relevance -to 
certain types of curricula and jobs will eventually be substantiated in validity 
studies, but this is only the reviewer’s hunch. 


Stuart A. Courtis, Professor Emeritus of Education at the University of 
Michigan, starts off flatteringly but ends by questioning the value of all 
tests: 


No tests this reviewer has ever seen or used approach the PMA tests in the care 
and ingenuity evident in their construction. The authors have very wisely broken 
away from conventional memory question-response type of items. In all tests the 
exercises involve mental functioning in action. eet In other words, the tests and 
manuals might well serve as models for all publishers of tests to follow. . . . This 
reviewer, however, rejects as inappropriate or totally false both the statistical meth- 


Oscar К. Buros (Editor), The Fourth Mental Measurements Yearbook, page vii. 


Highland Park, New Jersey: The Gryphon Press, 1958. 
10 bid., pages 698-710. 


220 THE TESTING PROGRAM 


TABLE 29 


Orts ScanE ғов Ratine STANDARD Tests! 


Scale for Rating Tests 


Names of Tesis 


Manual (5) 


Validity (15) 


Reliability (10) 
Reputation (5) 


Ease of Administration (Total 15) 
(a) Preparation (4) 


(b) Time limits (4) 


(c) Explanation needed (3) 


(d) Alternative forms (4) 


Ease of Scoring (Total 15) 
(a) Objectivity (10) 


(b) Time required (3) 


(c) Simplicity (2) 


Ease of Interpretation (Total 15) 
(a) Norms (5) 


(b) Directions for interpreting (4) 


(c) Class record (1) 


(d) Application of results (5) 


Convenient Packages (5) 


Typography and Makeup (5) 


Test Service (10) 


Total (100) 


ods and the theories of primary mental abilities derived from their use. . 


ТИВ 


reviewer predicts that ће РМА tests, in spite of their structural and funetional 


excellence, will not yield laws or educational principles any more than o 


have done. 


п Published by World Book Company, Yonkers, New York. 


her tests 


STEPS IN THE TESTING PROGRAM 221 


P. E. Vernon, Professor of Educational Psychology in the Institute of 
Education of the University of London, seems favorably impressed. He 
points out that: 


Thurstone has clearly retreated from his earlier opposition to “general” intelli- 
gence. He not only allows total scores to be caleulated for each battery, but even 
provides for their conversion to IQ’s. 


Thus there are reviews of the PMA series by five different persons which 
fill ten large double-column pages and together cover almost all points of 
interest to nearly any prospective user of the tests. Few of the other tests 
get this much coverage; many are reviewed by only a single person. 

The Fourth Yearbook also cites nine reviews of PMA tests in previous 
yearbooks. 

In the choice of standard tests it is always wise to have available for 
careful examination both the test blanks and the test manuals of all tests 
being considered. Most county and city school systems will find it desirable 
to maintain for such purposes up-to-date sample (“specimen”) sets of 
the more important tests published. To assist in making the necessary ex- 
aminations and comparisons, the use of a rating scale will be found helpful. 
The first one published, prepared by Otis, is reproduced in Table 29. 
A more analytical scale for evaluating achievement tests is that of Cole 
and von Borgersrode, given in Table 30. 

The use of these scales not only directs attention to significant points 
but also gives some idea of the relative weight of the various items. In the 
authors’ opinion, Cole and von Borgersrode assign too much weight to 
reliability, and both scales assign too little weight to validity, the most 
important quality of any measuring instrument. Also, the relative weight 
assigned by both these scales to what may be termed usability seems some- 
what heavy. The authors suggest a slight revision in weights and a re- 
grouping of sections IV, V, VI, and VII under the heading usability. The 
major divisions and subdivisions, with revised weightings, would then be 
as follows: 

Division Points 
L Preliminary Information 
IL. Validitys- ИОНЫ den nena 50 


A. Curricular... ttn 30 
B. Statistical... eem 


Ш. Reliability... -eerte ere 
A. More important points. . 
В. Less important points....... 


IV. Usability... eesse ЕЕЕ 
A. Ease of administration.......---+-- 10 


В. Ease of scoring... nn 5 
C. Ease of interpretation........ «e 10 
D. Miscellaneou8...... ntn 5 


222 THE TESTING PROGRAM 


TABLE 30 


Cong-voN BorGERSRODE SCALE FOR RaTING STANDARDIZED Trsvs!? 


I. Preliminary Information 


. Exaet name of test. 

. Name and position of author. 

. Name of publisher and nearest address. 
Cost. 

. Date of copyright. 

. Purpose of test. 


П. Validity (25) 


A. Curricular (15) 

. Exact field or range of educational functions which test measures? 
. Ages and grades for which intended? 

. Criteria with which material was correlated? 

Do questions parallel good teaching procedures? 

How wide is sampling of important topics? 

What is the social utility of questions? 

. Is test claimed to be diagnostic? (If so, see VI, 5, с, below). 


tatistical (10) 

. Correlated against what outside criteria? 

. Size of coefficient of correlation? 

Size and representativeness of sampling? 

. Proof of adequacy of items (such as statements as to experimental tryout of 
items individually to determine that no large percentage is failed or passed 
by all pupils and that the items show a consistent incrense of percentages 
of successes with successive age or grade levels.) 


III. Reliability (25) 


A. Most important points. 

. Correlated with what? 

. Size and representativeness of sampling? 

. Reliability coefficients. 

. The means of the distributions. 

The standard deviations of the distributions. 
Other similar statistics. 

Intercorrelations. 


ж о № н MDM у 


NooBom- 


B. Less important but desirable. 
1. Order of giving various forms of test. 
2. Is test reliable enough statistically for individual measurement, or should it 
be used only for groups? 
3. Evenness of scaling (see II, B, 4). 
4. Are pupils accustomed to this type of test? 


IV. Ease of Administration (15) 


1. Manual of Directions (3) 
a. How complete and simple is the manual? 
b. Does manual control test conditions well? 
с. Typographic makeup. 


12 Robert D. Cole and Fred von Borgersrode, “A Scale for Rating Standardized 
Tests,” School of Education Record of the University of North Dakola, 14: 11-15, Oc- 
tober, 1928. 


| STEPS IN THE TESTING PROGRAM 223 


TABLE 30 (Continued) 


Co1z-von BORGERSRODE Scare rog Ratinc STANDARDIZED Tests? 


—————————M—MÀ—M— 
ГУ. Ease of Administration (15) (Cont.) 


2. Simplicity of administration (9) 
a. Amount of explanation needed for pupils by examiner? 
b. Are directions to pupils clear, detailed, comprehensive? 
c. Is arrangement of test convenient for pupils? 
d. Are samples and “fore-exercises” given when needed? 
e. Time needed for giving? 

3. Alternate forms (3) 


] а. Number. 
b. Evidence of reliability. 
c. Evidence of equivalency. 
V. Ease of Scoring (10) 

Degree of objectivity—purely objective or some judgment on part. of examiner? 

Are adequate directions given—clear, equal to all emergencies? 

Is scoring key adjusted to size of test? 

Time needed to score one test. 


Simplicity of procedure. 
a. Number of processes needed to get final score? 


СИ ен 


ҮІ. Ease of Interpretation (20) 


1. Norms (6) 
a, Kind—age, grade, percentile, standard score, ete. 
b. Derivation—size and representativeness of sampling. 
c. Tentative, arbitrary, or experimental? 
d. For separate parts? 
e. How expressed? 
. Is class record provided? (1) 
Are there provisions for graphing results? (1) 
Is interpretation of d easy or hard? (2) 
. Application of results (1 
i ye directions or pee given for application of results to benefit teach- 
ing or administration? 7 
b. Are tests survey ог diagnostic? 
c. If а я "a 
1) Proof of diagnostic value Р 
v. What о ог principles underlie construetion? А 
(3) How many different skills, abilities, or aspects of the subject are analyzed 


ena o bo 


d? f Nan ; 
(4) Dosis ande of total subjects into unit abilities follow teaching 
practices? db: 
(5) Is the diagnosis individual or class—proof? 


(6) Does the test demand tabulations of individual pupils’ errors to secure 


a diagnosis? Ч 
(7) Is а remedial program provided or suggested? 


VII. Miscellaneous (5) 


1. Typography and makeup. 
а. Arrangement of printed matter. 
b. Legibility of type. 
c, Quality of paper. 


d. Are test blanks free from. distractions, norms, directions to examiner, еќе.? 


224 THE TESTING PROGRAM 


TABLE 30 (Continued) 


Сотя-уох BORGERSRODE SCALE ror RATING STANDARDIZED TESTS 


VII. Miscellaneous (5) (Cont.) 


2. Is the time required for giving as small as is consistent with reliable measure- 
ment? 
3. Is the cost in keeping with the amount, scope, and reliability of the results 
yielded? 
- Is good test service provided by the publisher? 
. Kind of objective questions used? 


олњ 


А desirable procedure is to have а group of at least three competent, 
people, each independent of the others, look over all the tests being con- 
sidered, the manuals accompanying them, and any evaluations available. 
Each judge first compares the tests with respect to validity, and records 
the judgment in points before considering anything else. Then he goes on 
to reliability and makes a similar judgment on each test. Finally, he does 
the same for usability. This method will tend to produce greater agreement 
among the judges regarding the relative ranks of the tests on the criteria 
individually. After all, the total point score allowed a test is less important 
than the rating on the divisions separately. 

Emphasizing the close relationship between teaching and testing, Brown- 
ell suggests the following criteria!? for evaluating tests: 


1. Does the test elicit from the pupils the desired types of mental processes? 

2. Does the test enable the teacher to observe and analyze the thought processes 
which lie back of the pupils’ answers? 

3. Does the test encourage the development of desirable study habits? 

4. Does the test lead to improved instructional practice? 

5. Does the test foster wholesome relationships between teacher and pupils? 


In selecting a test for a given purpose, the grade level on which it is to 
be used must be given consideration. Test publishers often suggest a con- 
siderable grade range in which the test may be used. But both test authors 
and publishers tend to be too optimistic concerning the range of usefulness 
of their tests. For example, an intelligence test that is supposed to be suit- 
able for grades three to eight may be found to be too difficult for the third 
grade and too easy for the eighth. The reader will doubtless recall from a 
discussion in Chapter 4 that it has usually been found that a test has 
optimum discrimination for a group whose average corrected-for-chance 
score-is approximately 50 per cent of the maximum score possible on the 
test. It must be remembered, however, that the discriminating function of 
diagnostic and certain other specific tests is usually relatively unimportant. 


18 William A. Brownell, Some Neglected Criteria for Evaluating Classroom Tests,” 
National Elementary Principal, 16: 485-492, July, 1937. 


STEPS IN THE TESTING PROGRAM 225 


С. Administering the Tests 


The next step in the testing program is the administering of the tests. 
In order to insure that this is properly done, three questions must be an- 
swered : 


1. When should the tests be administered? 
2. Who should administer the tests? 
3. What is the correct procedure to follow? 


Each of these questions deserves eareful consideration. 

When should the tests be administered? As problems concerning 
the use of intelligence tests differ somewhat from those concerning the use 
of achievement tests alone, it is better to consider the two separately. When 
should intelligence tests be administered? There is general agreement that 
it is not necessary to give the same pupils intelligence tests every year, but 
there is also agreement that possible fluctuations on group tests are great 
enough to warrant giving such tests more than once. The fluctuations are 
likely to be most serious in the primary grades.“ A reasonable plan em- 
ployed by many school systems is to give intelligence tests at transitional 
points in the pupil’s school history. As Stoddard suggests: “Intelligence is 
analogous to health; any estimate of it should be rechecked close to the 
making of an important decision.” Procedure would therefore vary ac- 
cording to the school organization. A suggested minimum program is as 
follows: 


Type of Organization Grades to Give Intelligence Tests 
Six-six plan... eee First and sixth or seventh 
Seven-five plan.......+.-- First and seventh or eighth 
Eight-four plan........ First and eighth or ninth 


Six-three-three plan......- First, sixth or seventh, and ninth or tenth 


If possible, it would be well to add to this minimum program a test at about 
the fourth grade and one at the end of the high-school course. 

There is some disagreement regarding the best time of year in which to 
give the intelligence tests. Of course, if the tests are to have maximum 
value, their results must be made available at the very beginning of these 
transitional periods. This means they should be given early in the first 
grade if the pupils have had no previous kindergarten experience. Since 
Updegraff!* found that for preschool children the reliability of the test is 
"wr Mi een the Indices of Intelligence Derived 


11 (f. Mildred M. Allen, “Relationship betw 
from NEA o Intelligence "Tests for Grade I and the Same Tests for 


Grade IV,” J 1 of Educational Psychology, 36: 252-256, April, 1945. 
B Si Huet The Meaning of Intelligence, page 94. New York: The Mac- 


illa ў e В В 
"Шаа Сора оа Determination of а Reliable Intelligence Quotient for the 


Young Child," Pedagogical Seminary and Journal of Genetic Psychology, 41: 152-166, 
September, 1932. 


226 THE TESTING PROGRAM 


increased by postponing testing until two weeks after entrance to school, 
it may be well to avoid giving the test till the second or third week of school 
in the lower grades. The later tests can be given either at the beginning of 
the transitional year or at the close of the year preceding. There is a tend- 
ency to have tests for college entrance administered in the high schools 
near the close of the senior year. This is obviously necessary if such tests 
are to be used in counseling these seniors regarding the feasibility of con- 
tinuing their education. There will usually be a few pupils who will transfer 
into the system and who have not had intelligence tests, and others in the 
system about whom teachers may feel serious doubt regarding the validity 
of the existing record. 

The frequency with which achievement tests should be used will depend 
primarily upon the purpose they are to serve. Most purposes, however, 
will require at least two series of tests administered at intervals of a se- 
mester or à year. Most achievement tests have norms for the middle and 
the end of the year, but often for no other time. When tests are given at 
these periods, comparisons with norms are easiest. There is also the fact 
that many studies have shown a considerable decline in knowledge at the 
end of the summer vacation. This would seem to favor giving the tests 
at the end of the school year, when the pupils status is more normal. 
A comparison between the records made by pupils at the end of each of two 
successive years is usually more trustworthy than that between the begin- 
ning and end of one year. 

There are some advantages in having the tests administered in the fall. 
Almost always some pupils will enter the school for the first time and their 
status can best be determined by administering tests to all the pupils. The 
teachers will then have the entire school year in which to remedy any defi- 
ciencies revealed. Fall testing also avoids the undesirable practice of cram- 
ming. If too much emphasis is placed on “improvement” shown during 
the year, however, pupils may be tempted not to do their best on the first 
series of tests. This would not be the case if progress is measured between 
two series of tests administered at the end of the preceding year and at the 
end of the current school year. 

This praetice will also make it possible to have the information serve 
several purposes. It can be used partially as a basis for determining pro- 
motion from the grade, for educational guidance, and possibly for section- 
ing the next grade. There seems also no good reason why an analysis of the 
errors revealed cannot serve equally well as a basis for remedial teaching 
in the succeeding grade as if the new teacher had given the test at the 
beginning of the year. Of course, in some instances there might be con- 
siderable value in repeating the test at the beginning of the year in order 
to determine the effects of the summer vacation, apart from the better- 
established weaknesses which were present when the vacation started. 


STEPS IN THE TESTING PROGRAM 221 


Moreover, the analysis of errors is more trustworthy when based upon two 
samplings of performance than upon one. 

Who should administer the test? Obviously, only competent persons 
should administer standardized tests. It is not always an easy matter to 
tell who is really competent, however. In the case of individual tests of the 
Stanford-Binet type, this requirement means that only persons who have 
had specifie instruetion in college classes should attempt to administer 
them. There should be at least one person in every school who is qualified 
to give such tests. When tests are used for purposes of research, or when 
they are used to compare one grade, elass, or school with others, they should 
usually be given by one person, or a small group of specially trained ex- 
aminers. But in the ordinary testing program, employing group intelligence 
tests and achievement tests, the regular classroom teachers should usually 
administer the tests. Most of them will weleome an opportunity to do so. 
At the present time there seems no good reason for selecting a test whose 
administration is so difficult as to be beyond the mastery of average teach- 
ers in the publie schools. The point of view of McCall seems eminently 
sound: 

Many years ago certain specialists sought to secure a monopoly of the privilege 
of using standard tests by trying to persuade educators to regard the tests as 
possessing certain mystic properties. A few of us with Promethean tendencies 
set about taking these sacred cows away from the gods and giving them to mortals. 
Can teachers be entrusted with tests? If not, then teachers ought not to be trusted 
with 90 per cent of their present functions. We now entrust them with the far more 
difficult task of teaching reading, creating concepts and building ideals. Let us not 
strain at a gnat when we have swallowed fifty elephants. 


But it is well not to take the competency of the examiners for granted. 
One of the best plans is to get the group of examiners together and demon- 
strate the administration of the tests to be used. One way to do this is to 
give a demonstration with a regular class and to follow this by a discussion 
with the examiners of the procedure they have seen. Another way is to 
administer the test to the examiners themselves. This should be followed 
by a full discussion of the procedure involved. It is usually well to suggest 
that after cach examiner has studied the manual he try the procedure on 
some other person, such as а member of the family; or two teachers may 
try it out on each other. If questions then arise, they can be settled by a 
conference with the person in general charge of the program before the 
examiner goes before his group actually to administer the test. It has been 
found that, if such measures are taken, the regular classroom teachers can 
obtain practically the same results with group tests as can be obtained by 
special examiners. 


17 W., A. McCall, in The Test Newsletter, published by Bureau of Publications, Teachers 


College, Columbia University, December, 1936. 


228 THE TESTING PROGRAM 


What procedure should be followed? Although the procedure of 
administering group intelligence tests and achievement tests is not beyond 
the mastery of classroom teachers and school administrators, some diffi- 
culties may arise. In fact Ligon™ argues that good group testing is more 
difficult than individual testing. In the first place, the conditions for the 
test must be favorable. It is usually best to have the tests given in the 
familiar environment of the pupils’ own classrooms. Especially is this true 
of younger children. It is well always to have the tests given at regular 
class time without permitting them to run over into lunch hour or play 
time. For the same reason it is desirable not to have tests just before or 
just after an important event, such as a holiday, a school party, or an 
athletic contest. Precautions should be taken to avoid all unnecessary dis- 
tractions and interruptions during the progress of the test. It is a good plan 
to hang on the outside of the classroom door a card which reads: Tests 
Going On. Please Do Not Disturb. Pupils should be instructed to remove 
everything from the tops of their desks except two well-sharpened pencils 
and an eraser. The examiner should also have ready a few extra pencils 
in case of an emergency. All these things must be looked after in order to 
insure favorable working conditions for the test. 

As a rule, anyone can administer a group test successfully who meets 
three requirements. The first of these is the ability to read well. Good 
silent reading is required for the mastery of the directions printed in the 
manual which accompanies the test. Good oral reading ability is required, 
for the directions to the pupils should be read, not recited from memory. 
To undertake to give the test from memory is to run a serious risk of leaving 
out some important word or phrase or of paraphrasing the directions in 
such a way as to change their meaning. But the examiner should be so 
familiar with the manual that he can read the directions with his eyes off 
the page a good part of the time. The directions should be read with proper 
emphasis in a clear voice just loud enough to be heard throughout the room. 
The aim should be to make the meaning understood without arousing 
anxiety or excitement. 

The second requirement for administering a test is the ability to keep 
time accurately. If the test has a single time limit of, say, twenty minutes 
or more, it is probably preferable to time it with an ordinary pocket watch 
rather than a stop watch, since the latter may, upon occasion, be quite 
erratic. When a pocket watch is used, set its hands to some convenient 
time such as the beginning of an hour and give the starting signal just as 
the second hand reaches 60 (which is also 0). It will usually help students 
and examiner alike to have a clock in the room which shows everyone the 
correct time. Some testers use a special device known as an interval timer. 

The aim should be to keep the time to a second. On most tests the signal 


18 Ernest M. Ligon, “The Administration of Group Tests," Educational and Psycho- 
logical Measurement, 2: 387-399, October, 1942. 


STEPS IN THE TESTING PROGRAM 229 


to start is, “Ready, go!" or “Ready, begin!" When this signal is given, 
the examiner should note the ezact time—hour, minute, and second. This 
should be recorded immediately, preferably on a small card or specially pre- 
pared blank. The record for Test 1 would look like this: 


Test 1 Hr. Min. бес, 
Time test Берап................. 9 0 0 
Time allowed... ...... ee eere 5 


Time to ор. + (xv EXT Seo eee 9 5 0 


Experienced examiners know that it is never safe to trust one's memory 
to keep the time. A written record must be made. 

The third requirement for administering a test is the ability to follow 
directions accurately. The manual should be followed verbatim. No devia- 
tion whatsoever is permissible. To add anything to, or to modify the direc- 
tions in any way, means that it is no longer à standardized test. Boynton? 
gives some interesting illustrations of unconscious clues given by inexpe- 
rienced examiners using the Stanford-Binet, One examiner, for example, 
when asking the meaning of the word “tap” in the vocabulary test began 
totap on the table, and when he came to the word “eyelash” he looked the 
child straight in the eye and batted his eyes rapidly. The norms are made 
on the assumption that a prescribed formula is to be used. As a part of the 
preliminary instructions pupils are almost always told not to ask any ques- 
tions after the test starts. Occasionally a pupil forgets this instruction and 
holds up his hand for a question. The examiner should walk over to him 
and, if it is a reading test or an intelligence test whose purpose is following 
directions, should say in a quiet voice, “Read it carefully and do just what 
it says." If it is an ordinary achievement test and the pupil is concerned 
about where to put his answer or some other point of mechanics that does 
not involve the answer to a question in the test or modify the directions 
already given, it is permissible to set the pupil at ease without causing 
disturbance. Kelley suggests this principle in handling the child who is in 
trouble: “The examiner should be free to say or do anything that does not 
disturb or delay pupils at work, that does not help the individual child in 
the thing in which’he is being tested, and that does set him to work again 
after some foolish or trivial issue has troubled him."* Examples of permis- 
sible statements are: Yes, you may change your response if you decide 
it is wrong,” “Just work on the side of the sheet; you do not need scratch 
paper,” “When you have finished the first column go right on to the next 


one,” “Мо, you must not go back to a test you have passed,” and the like. 


But if the pupil asks the meaning or spelling of a word, or how to answer 


19 Paul L. Boynton, Intelligence, Its Manifestation and Measurement, pages 276-277. 
New York: D. Appleton-Century Company, 1933. 

?? Truman Lee Kelley, Interpretation of Educational 
World Book Company, 1927. 


1 Measurements, page 46. Yonkers: 


230 THE TESTING PROGRAM 


a test item, the examiner should say quietly: “I cannot tell you. Go on to 
the next one.” In case of doubt, the examiner should err on the side of sayin n 
nothing. While the test is in progress the examiner must be alert constantly 
to see that the pupils neither help nor hinder each other nor are distracted 
by external factors. Ligon indicates the following requirements of good - 
group testing: “That all the subjects understand the instructions, that they | 

all work throughout the assigned time at their optimum level ot achieve 
ment, that they are in no way helped, hindered, or distracted by опе аде. 
other, that they do not quit trying or omit any section of the test, that 
examiners give instructions adequately and in a stimulating, effective tone. 1 
of voice—not a dull bored monotone—and that proctors are observing 
every movement of the group, stimulating lagging souls, inhibiting wander- 
ing eyes, and detecting failure to follow instructions.” A test is more than 
а measuring device; it presents a standardized situation in which to observe” 
pupil behavior. Any occurrence observed during the progress of the t 
that may throw light upon the interpretation of the results should be care- 
fully recorded. 


D. Scoring the Tests 


It is desirable to have the tests scored as quickly as possible and wi 
the highest possible degree of accuracy. As a rule, then, that qmi is best 


money, time, and energy. There are two questions ое 


1. Who should score the tests? 
2. What technique should be used? 


Who should score the tests? In actual practice, standard tests are | 
scored by a variety of persons. Sometimes, especially in larger systems, the - 
work is done by a clerical staff at a central bureau, or by the use of scoring. 
machines; the scoring may be contracted for with some outside agency, | 
such as the test publisher; sometimes it is done by advanced students under — 
supervision; at other times the scoring is done by administrative officials; | 
but the most common method seems to be to have the work done by the 
regular teachers. Except in the larger systems where there is a bureau of - 
research equipped with special facilities, the scoring is probably best done — 
by the classroom teachers. In that way not only can the work be done 
promptly, but the teachers can probably learn something of value about 
the types of errors made on the achievement tests. But it is important to - 
get the scoring done without producing an unfavorable attitude toward it 
on the part of the teachers. Some schools have found it very satisfactory | 
to dismiss classes at noon when the testing is in progress, so that the 
teachers can devote the afternoon to the work of scoring. This would seem — | 


i|. AC UD нее 
3 Ernest M. Ligon, op. cit., page 387. 


STEPS IN THE TESTING PROGRAM 231 


an effective way of emphasizing the important fact that teaching and test 
ing are processes that are intimately related. 

What techniques should be used? Every reasonable precaution 
should be taken to assure a high degree of accuracy in scoring. It must not 
be assumed that merely because the directions are clear, the key complete, 
the separate answer sheets well designed, and the process entirely objective, 
perfect protection against errors is thereby afforded. Numerous studies 
give abundant evidence to contradict this assumption. They reveal two 
distinct types of errors in scoring, constant errors and variable errors. A com- 
mon example of the former type is misunderstanding the scoring directions; 
for instance, by counting omissions the same as errors, when using the 
scoring or correction formula. Such errors are especially serious, because 
there is no possibility of their offsetting each other according to апу S0- 
called “law of averages.” Variable errors, on the other hand, sometimes 
tend to make the score too high and at other times too low. While such 
errors may do serious harm to individual pupils, they tend to cancel each 
other in group measures such as averages. Examples of variable errors are 
errors resulting from carelessness, errors in counting the scores, errors in 
entering the scores on the front of the test booklet or on the record sheet, 
and errors in adding up the total score. Some of the most serious errors 
found are not in marking the paper at all but in counting and in addition. 

Clearly, then, accuracy in scoring cannot be taken for granted. What is 
to be done about it? The first thing is to prevent the occurrence of errors 
whenever possible. The scorers must be taught how to score the papers and 
not merely told how to do it. They should be given an opportunity to study 
the manual and the scoring keys. Whenever possible, an actual demonstra- 
tion of seoring should follow. Tt is a good idea, also, to check carefully the 
first few papers marked by beginners to detect errors at the outset. This 
procedure should reveal any constant errors and the principal types of 
variable errors. It is always desirable to have each page or part of the test 
scored through all the papers in a set before going on to the second page 
or part of the test. If the scorers work in groups, as is usually desirable, 
each one can specialize in marking one part of the test, and pass the test 
when scored to the next scorer, who is specializing in marking the next part 
of the test. This procedure will reduce the risk of error and at the same time 
will increase the speed of scoring. It is usually an especially poor technique 


to have one person read the answers while the scorers mark the papers. 


This is slow, because the slowest scorer sets the pace. It also increases the 
risk of error, owing to the possibilities of losing the place or of failure to 
hear correctly. Colored pencils are desirable. Inexperienced scorers should 
mark each item in the test being scored in some uniform manner, such as 

d 0 for omitted items. Experienced scorers 


+ for correct, — for incorrect, an : 1 ‹ 
will save time by marking only the incorrect and omitted items. It is, of 


fall — drop 


SAMPLES 
north — south 


expel — retain 
comfort — console. ........ 
waste — conserve 
monotony — уапеу..,.........,.. 
quell — subdue ...... ЕО ; 


major — minor 
boldness — audacity 
ТЕ рее М cies eee de shee 
prohibit — allow .......... 
debase — degrade .......... 


COON NAD MAW DH н 


recline — stand ....,..... MORES RESET 
approve — veto . ...... Bee eR Con 
amateur — екреге,..,....:........ х 
evade — shun. ..... 


я ыыы 
Pw Ы м 


4 
wn 


= 
a 


tonic — stimulant . . 
incite — quell 
economy — frugality........... ., ; 
rash — prudent 


= 
3 


ГӘ 
oo 


ә = 
Oo 


obtuse — acute ........ SIRE TUE Sae 
transient — регтапепе............. 
expel — eject ....... а aie 
hoax — deception 
docile — submissive .....,......... 


NANN 
о ә м 


Sb 
an > 


N 
е 


мах — wane ... 
incite — instigate .: 
reverence — veneration ........ 
asset — liability .. 
appease — placate 


DIDI 


niearss et tm 


L4 
XQ 


S 
oo 


IDEE 


is) 
o 


а. 169.3 seh 


t 
o 


Figure 6. 


TEST 3. WORD MEANING 


When two words mean the SAME, draw a line under “SAME.” 
When they mean the OPPOSITE, draw a line under “OPPOSITE.” 


Right. 15... Wrong 


THE TESTING PROGRAM 


same — opposite 


same — opposite 
same — opposite 


same — opposite 
same — opposite 
same — opposite 
same — opposite 


same — opposite 
ped s 

same — opposite 

same — opposite 

beer. 

same — opposite 
3 : 

ame — opposite 


same — opposite 
same — opposite 
pal leat 
same — opposite 
same — opposite 
Ат H 
Same — opposite 


same — opposite 
same — opposite 
same — opposite 
same — opposite 
same — opposite 


same — opposite 
same — opposite 
same — opposite 
same — opposite 
same — opposite 


same — opposite 
same — opposite 
same — opposite 
same — opposite 
same — opposite 


1. B.. Score. Л.О... 


+++++ 


(+14++ tolti 


+Ootoo +++ о 


An Illustration of the Procedure Followed in Scoring Test 3 of the Terman 


Group Test of Mental Ability, Form A. (Copyright by World Book Company.) 


STEPS IN THE TESTING PROGRAM 233 


STANDARD TEST SCORING RECORD 


Name of Test оа 200 4 Дења oat 
School LU Lat il A ET SES j 


No. Scored by Checked by 


Scores added by Errors Checked by Comment 


Checked by Comment 


Norms, etc., by Errors 


00824 


Class record by Table ird Medan by Graph by 


Mutta 0.0 RUIT Sea A Sorok Aen 


Figure 1. A Sample Standard Test Scoring Record. 


mark the items below the last one the pupil at- 
o draw a horizontal line across the test under the 
illustrates the scoring of an alternative- 
using the formula Score = R- W. 

The writers have found that the simple device of keeping a written record 
of who marks, checks, transcribes, or totals each part of the test reduces 
the likelihood of error. If the scoring is organized systematically, it is a 


course, unnecessary to 
tempts. But it is well t 
last item attempted. Figure 6 
response test of word meaning, 


234 THE TESTING PROGRAM 


simple matter to keep such a record on a mimeographed sheet attached to 
each package of tests when scored, as shown in Figure 7. 

But in spite of these preventive measures, certain errors are likely to 
occur. The safest plan, therefore, is to have each set of papers marked 
a second time by different scorers, using pencils of a different color. Dun- 
lap? found that items most subject to errors in Scoring are of the two- 
response type requiring a scoring formula and items requiring the un- 
derlining of more than one word. If a complete rescoring does not seem 
practical, a sampling method may be followed. Each fifth or tenth paper, for 
example, may be selected and carefully rescored, and if only an occasional 
minor error is found, the whole set may be safely accepted. On the other 
hand, if frequent or serious errors are found in these sample papers, the 
entire set should be rescored. In any event it is important to have some 
person other than the original scorer check the totals for each part of the 
test and for the whole test, all, substitutions in the scoring formulas, all 
transcribing of scores, and all transmuting of point scores into derived 
scores.” It is possible to locate many serious errors by examining closely 
the profile of each individual pupil on all tests with this form of record. 
Any score much higher or much lower than the general level is suspicious. 
Also, when two or more tests are used which purport to measure the same 
function, any serious discrepancies should be scrutinized, on the supposi- 
tion that a high positive correlation is to be expected. The standard of 
absolute accuracy should be accepted by all scorers. The possibilities of 
serious injustice to individual pupils by errors in scoring should be fully 
recognized. 


E. Analyzing and Interpreting the Scores 


After the tests have been scored and checked, the next step is the analysis 
and interpretation of the results. Both processes go on together, for analy- 
sis is worthless without interpretation and interpretation is impossible with- 
out analysis. Analysis is of two main types, statistical and graphical. Before 
either can be undertaken, however, there is the important preliminary step 
of classification and tabulation. An analysis of errors appearing in the test 
papers is usually of major importance to the classroom teacher. Chapters 3, 
9, and 10 are concerned with a discussion of the whole problem of analysis 
and interpretation; only an outline will be given here to indicate the steps 
involved: 


1. Classification and tabulation of scores. 
2. Statistical analysis of scores. 


22 Jack W. Dunlap, “The Relationship Between the Type of Question and Scoring 
Errors," Journal of Experimental Education, 6: 376-379, March, 1938. b 

23 Derived scores are obtained from tables of norms. Each point score is expressed in 
some equivalent unit, such as an age or percentile score. The interpretation of these 
units is considered in Chapter 10. 


STEPS IN THE TESTING PROGRAM 235 


3. Graphical analysis and representation. 
4. Use of norms and standards. 
5. Analysis of errors. 


In a complete testing program all five of these steps will receive atten- 
tion, although not always to the same degree. If the primary purpose of the 
testing program is diagnosis, for example, the fourth step would be rela- 
tively unimportant and the fifth step relatively important. The reverse 
would be true of a program whose main objective is a study of the compara- 
tive efficiency of various grades, classes, and schools. 


F. Applying the Results 


The application of the results is the crux of the whole testing program. 
Everything that has gone before is really preliminary. Whatever value the 
tests are to have depends in the last analysis upon the use made of the 
results. 

Just what is to be done, of course, depends upon the purpose of the pro- 
gram. Later chapters will consider in some detail the procedure to be fol- 
lowed for several administrative and instructional problems. It will be 
sufficient at this point to give some idea of how the procedure will vary 
with the purpose. 

Suppose, for example, that the purpose of the tests is to determine the 
present status of a particular school with the idea of its improvement, and 
that the test data are before the principal. The question now is, what is 
to be done? Upon the basis of the test scores and other pertinent data, such 
as the teachers! estimates, health reports, age-grade status, and the like, 
several pupils are given trial promotions to the next higher grades. A small 
group of pupils, whose achievement and intelligence scores are well below 
the central tendeney of their respective grades, are organized into an un- 
graded class and put in charge of а teacher whose outstanding virtues are 
sympathy, patience, and common sense. Ability groups are also organized 


in a few grades and classes, with appropriate differentiation in curricula 


and methods. 


Likewise, suppose the primary purpose of the testing program is to deter- 


mine whether or not the teaching emphasis is correct in the various sub- 


jects in the grades, and, when the test results are in, it is apparent that most 
of the grades are strong in arithmetic and spelling, about normal in reading, 
and weak in language and the social studies. Now what is to be done here? 
The principal calls the teachers together and presents the situation in tables 
and graphs, with suitable comments by way of interpretation. Then follows 
a regular “council of war.” One or more committees are appointed to make 
ke recommendations at a meeting 


a special study of the situation and to ma. 


to be held a little later. Eventually, after discussion and deliberation, a 


course of action is decided upon, looking to the improvement of the situa- 
tion in the weaker subjects. 


236 THE TESTING PROGRAM 


The procedure will again be somewhat different in essential respects if 
the primary purpose is diagnosis and remedial work in reading. Here the 
test results should be analyzed in some detail in each grade. An analysis 
of the test papers, item by item, is often very revealing. Special effort 
should be made to locate the specific nature of the reading difficulties. 
There may be found some general weaknesses, such as the inability to use 
the index and table of contents in a book, or possibly to locate the central 
idea in a paragraph. There are usually, in addition, other weaknesses, which 
appear in certain pupils and not in others. Some of these will not be revealed 
at all by the usual paper and pencil reading tests, but will require special 
tools and techniques. After considering these facts, the staff will try to plan 
a remedial program to be followed during the year. 

The essential point in all these cases is that something is done about the 
situation revealed by the test scores. To fail to apply the results in some prac- 
tical way is to fail in the testing program. 


G. Retesting to Determine the Success of the Program 


Most testing programs stop with applying the results, if, indeed, they 
go that far, But an essential step yet remains. After a reasonable time has 
been allowed for a trial of the remedial measures which were agreed upon 
in the light of the test data, a checkup should be made to determine the 
success of this program. Most tests are not sufficiently accurate to reveal 
progress over a shorter period than one half year. As a rule, a second form 
of the test or tests used in the beginning should be employed in retesting. 
If this is not done, it will usually be very difficult to express the results 
in terms sufficiently comparable to make an accurate measure of progress 
possible. Of course, not all the gain found can be correctly attributed solely 
to the remedial program. Some of it is doubtless due to the practice effect 
or to familiarity with the test itself, part of it to teaching received outside 
the school, and part of it to natural growth. Often, however, the improve- 
ment will be so marked as to indicate beyond a reasonable doubt the effec- 
tiveness of the program attempted. At other times the improvement will be 
disappointingly small. It is then usually wise to modify the remedial pro- 
gram in the light of the results obtained. 

The essential point is that the success of the remedial program must not 
be taken for granted. On the contrary, a definite effort must be made to 
check upon its effectiveness. To fail to do this is to leave the testing pro- 
gram incomplete. There is no better reason for taking the efficiency of the 
remedial program on faith than there was for taking the earlier results 
of teaching on faith. 


H. Making Suitable Records and Reports 


Certain records and reports are essential to the success of the testing 
program. But by no means do all these records and reports come chrono- 


STEPS IN THE TESTING PROGRAM 237 


logically at the end of the program. As a matter of faet, some of these are 
essential to the last three stages already discussed. 

In general, it may be said that four groups have an interest in knowing 
what the tests show: the pupils, the teachers, the administrative officers, 
and the parents or public. The nature of the report will naturally vary 
somewhat with the group to whom it is made, and the nature of the record 
with the specific funetion it is to serve. However, regardless of the type 
of record or its specific funetion in any partieular situation, its general 
function is always, as has been well stated by Stenquist, “to present test 
results and related information in such a meaningful way as to arouse 
interest and action, on the part of teachers, principals, supervisors. directors 
of special divisions, and superintendents.” * 

Report to pupils. The pupils have a right to know their performance 
on all achievement tests whether standardized or nonstandardized. In many 
cases it is well to go over the papers with the pupils in order to point out 
the nature of the errors made. 'The success of any remedial program will 
depend upon the pupils' cooperation. Long ago Thorndike stated the mat- 
ter succinctly in these words:*5 


The final justification for every testing regime rests in Mary Jones and John 
Smith, and it therefore behooves all persons who are making and giving tests to 
take them into partnership as soon and as completely as is feasible. 


It is usually considered dangerous to present the results of intelligence 
tests to pupils. And there doubtless is more possibility of harm than of 
good in making known the mental ages and intelligence quotients of indi- 
vidual pupils. Difficulty is most likely to result from scores at the extremes 
of the distribution. Both pupils and parents can reconcile themselves to low 
scores on achievement tests, for that can be explained to their satisfaction 
on the ground that it is the school’s fault. But low intelligence test scores 
seem to reflect directly upon the good name of the family, and this is re- 
sented, Only the exceptional pupil or parent has a fine enough philosophy 
of life to reconcile himself to the realities implied by a low score, and to 
resolve to make the most of it, There is also danger that the pupils with 
high test scores will be so inordinately puffed up as to endanger both their 
social standing with their fellows and their academic standing with their 
teachers. There are, however, special cases in which information regarding 
intelligence scores may properly be given. Some examples of these cases 


will be discussed in later chapters. 


24 John L. St ist, “The Administration of a Program of Diagnosis and Remedial 
або n Thirty. Fourth Yearbook of the National Society for the Study of Education, 
page 518. Quoted by permission of the Society. Bloomington, Illinois: Public School 


Publishi 1935. ; 
n amd. T. Thorndike, “Testa and Their Uses,” Teachers College Record, 26: 98-04, 


October, 1924, 


238 THE TESTING PROGRAM 


Records and reports to teachers. The classroom teachers need to 
have several kinds of records and information, most of which they can 
prepare themselves. Each teacher needs, first of all, a complete test record 
sheet for all his pupils. This sheet gives the test record of all members of 
the class arranged in descending order on the basis of the total score or 
on the basis of the previous teacher's rating. Many publishers of standard 
tests include such a record form with each package of twenty-five tests. 
Some writers recommend an alphabetical order, but the rank order may 
be more useful. Stenquist” has described how a testing program in a large 
city involving a quarter of a million tests annually was made to function 
smoothly. “We have found,” says Stenquist, ‘that the effective use of tests 
is a collective enterprise involving a high degree of cooperation on the part 
of pupils, teachers, principals, supervisors, directors and superintendents.” 
He also describes the “device that more than all else ‘sold’ our testing pro- 
gram to teachers, principals and all others concerned.”” This is simply 
an analysis chart for each class showing the names, ages, and distribution 
of scores on each test by entering the identification number of each pupil 
opposite his score. Duplicate copies are quickly prepared on blueprint 
machines and sent to the teacher, the principal, the supervisor, and the 
superintendent. 

Most tests which attempt to measure various aspects of a subject, and 
all test batteries, provide on each test a form for a graphical record of each 
pupil’s performance. Figure 8 shows a sample record that enables the 
teacher to see at a glance the pupil’s general level and to get some indica- 
tion of his strong and weak points as well. When the results of the later 
tests have been given and entered on the same sheet in a different color, 
a clear picture of the pupil’s progress is available. Such a record has obvious 
advantages. 

Diederich suggests a summary report by the teacher, the nature and 
function of which is described as follows: 


After every important test or examination, whether standardized or home-made, 
the teacher would do well to prepare a brief report covering the nature of the 
group which took the test, the nature of the test and how the group was prepared 
for it, the highest, lowest, and middle scores, and the national norms if they are 
available. This statement might be mimeographed and one copy put in the folder 
of each pupil who took the test. On these copies should be typed or written the 
pupil’s score or standing in the test, what this meant, if anything, with relation. 
to the objectives of the course, and some comment as to strengths and weaknesses 
shown, progress or decline, and possible reasons. Such statements should not take 
long to prepare, and they would be immediately valuable in counseling. Perhaps 
no other occasion in the normal processes of school life offers such rich opportunities 


26 John. L. Stenquist, “Making Tests Effective," The Nation's Schools, 20: 18-21, 
September, 1937. 

27 John L. Stenquist, “Devices for Testing," The Nation's Schools, 20: 30-33, Novem- 
ber, 1937. omn 

?5 Paul B, Diederich, “Evaluation Records,” Educational Method, 15: 439, May, 1936. 


Metropolitan: isie. Compl: A 
Name . Mary. .Smith. 167.5 Agel I* Date Мау. 15.1940 
Teacher Miss . Jones...... . School Lincoln. City Lexington 


INDIVIDUAL PROFILE CHART 
METROPOLITAN ACHIEVEMENT Tests: InrermMEDIATE ВАТТЕКУ — COMPLETE 


eee 


GRE EERBRBRS 
Бо 


eum 
PENEI 


OE Om w, 


€» Caco 63 Q0 Co ФФ LOO 
егеси 
bose 


T 
Ù 


* Values above Grade 8' and below Grade 3% are extrapolated. 


i i; furnish a graphic picture of the achievement of an individual pupil as 
Els d ie: Е The uds equivalents for the test scores, which are needed for the 
completion of the profile, are obtained from the norms at the end of each test, or they may be obtained 
from a table of norms based on the local school medians. The Supervisor's Manual should be ааай 
for further details concerning the interpretation of the Profile Chart and the uses of the test results. 


i Е і ievement Test. 
Fig . An Edueational Profile for a Standardized Achievement. = 
er 1 (Published by World Book Company.) 
239 


240 THE TESTING PROGRAM 


for helpful counseling. If tests and examinations are worth giving, they are worth 
recording and interpreting in a form which will enable those responsible for the 
pupil's education to act intelligently upon them, and to draw sound conclusions 
from them. 


Records and reports for administrators. In a small school the prin- 
cipal will find useful much of the same kind of information for all pupils 
in the school that the various teachers find useful for their own pupils. 
In the larger schools and school systems the administrative officers will be 
mainly interested, not so much in individuals as in the summaries of classes, 
grades, and schools. 

Probably the most important records in the principal's office are the 
individual records of each pupil in the school. To be most useful these 
records should be comprehensive, cumulative, and convenient. They should 
be comprehensive, including not only the test record but other pertinent 
information regarding the pupil, such as school marks, health record, per- 
sonality ratings, vocational aptitude and interest data, avocational expe- 
rience and interests, social background, age and grade progress, and the 
like. A notable survey of 870 schools by Leonard and Tucker” reveals that 
83 per cent kept complete records of all pupils permanently. The study also 
shows that the typical school used six tests, of which intelligence tests were 
most common. These records should be cumulative; that is, they should 
reveal the pupil's record over a period of years, preferably from nursery 
school or kindergarten through high school or community junior college. 
There is a decided advantage in having the complete developmental history 
of the individual pupil. One early study, for example, called attention to 

‚ the value of the information contained on the ordinary school record card, 
which gives a “picture of the pupil under varying conditions and stages of 
development" quite analogous to the biologists record of the life history 
of an organism. 

Such records must also be convenient to use. For each pupil there should 
be a card or folder, upon which all data are expressed in comparable units. 
Figure 9 shows the test data summary from the cumulative guidance record 
of the Department of Supervision and Curriculum Revision of the N. E. A." 
There are both advantages and disadvantages in a graphical record, of 
which type the one published by the Educational Records Bureau is widely 
used. The test record section is reproduced in Figure 10. Such a graphical 
record makes it possible to estimate fairly easily the quality, amount, and 


21 Е, А. Leonard and А. C. Tucker, The Individual Inventory in Guidance Programs tn 
Secondary Schools, 60 pages. Washington: U. S. Office of Education, 1941. 

* Clay Campbell Ross, The Relation between Grade School Record and High School 
Achievement, 70 pages. New York: Bureau of Publications, Teachers College, Columbia 
University, 1925. 

3! Educational Leadership, 1: 305-310, April, 1945. 


i SSY поцтопру [Gü0r]€N, ay) jo зпәшаојәләс urnpnorimn2) 
pus uorstajodng jo juaurjiedo(q aq) Jo piooog souBpIND эти) oy} шоу Áivurumg eq 9391, ‘6 ANSIA 


*popaou 31 'araq рэмоце oq eu sords эзуу, 


awa əxa 


NOLLVIO4N] BIIVAH TVINSJN аку IV2ISAHq XNVOLIINOIS 


шоя pez 


3L JO 
эшем 


WON | ‘а *9 | 2409g | шод 
20 % 
AWINAKITA—VIVC ISAL TVNOILYOndd 


700H2S HJIH WOINSS 
—Vivq 1881 TVNOILVonay 


700H2S H9IH woiNn[ 
—ViVd 1531, TWwxoitvonag 


VIVA 1591 40 АЧРИИП$ 


ошон ui 519430 


8194516 


3194301 
зәЗипод | зәр $80145 


90161220 зрэчюу | uonednoog з,зәцеу | aea 


19430519335; 
1э43е1-9335 
2943001 


EXE 
€ uonvonp3]| u»xods | uorgyey | реза | ‘dag | 1зем 
= 93enfueT эшем 


Ssauppy әшоң 


"NOH 9NINNS2NOO NOILIVAOJNI 


241 


"unj0jp[vorqd'*ue) ur рхозэз oAnv[num]) у 


‘ACN "WHOA MIN ‘1931$ MIGS ISIM LEP 


‘оузняв заноозы лумонуэпаз. 


“OT эха 


$100H2S 1мзамзазам 104 140238 ЗАЦУЛИМАО. 


мн — T 

| wit ET 
ct | ЕР Tes uu А t z 
jeu 4» 3 
i | 4 т A H 
| | и z 
i + i Е i $ 
врет D aw m ez [M 
EN br 3 zz AE ES s Ва 

— ov 

ЕА ex С222 i 

a = =| | 
x Los T —de| |е 
= 2532 53 
£ E S 69 SE 
р е а |28 

= а l1 

"ET Д4] вв 
A de «| |o 
— | "ee æl |2 
] 5 
z sj |5 

66 

нта очеве зма [нати очон оја чат ји оч аво [ааз ed ee eee rla salan ejem ujas. 


1 
= = i 
= 1 
288! 

- T a 
+85 4 FREER 
n тэту > eed Sree | 
: 22257 5982 ! 
Bee: 
— G4! 
77775 та 1 
7 Te А2 Tura аа } 
- Е : = = i 
С rar "x 153г "x san "x FE "> isn > Fc Н 

‘eg ее 
запіша 
т c HA ЕЕ 
з Го i ess 2534 C3ED ES EN Carm 1531 oan Em So 2991 EN rm 
e 
hi? 
i 
55 
dowans Torrens НЕС 1o3rens 133rens 133r6ns 

Е ee + ре em 
v7 or Fa a LE z 2 ET] 


зімоніша 


E] 


ззауно WOHISO jJ 


242 


STEPS IN THE TESTING PROGRAM 243 


consistency of progress made. A distinguishing feature of this record is the 
use of percentile ranks, so spaced that equal distances vertically represent 
more nearly equal changes in achievement than do the percentile ranks 
themselves. 

Such records, although easy to interpret, are somewhat laborious to pre- 
pare. Hangen® found that the graphical record required twice as much 
time to prepare as the numerical record and was somewhat more subject 
to error. This is perhaps largely responsible for the discovery quite a few 
years ago that only about one third of the schools which held membership 
in the Educational Records Bureau made a graph of all test results.” Care 
must be exercised that such records do not become ends in themselves, and 
that not so much time is devoted to their preparation that none remains 
for their practical use. Schools with limited resources and little clerical 
assistance should be content with less elaborate record systems than those 
which may be feasible for larger and wealthier schools. 

Warnings concerning profiles. Because of their deceptive simplicity, 
profiles invite improper interpretations. Three precautions must be kept 
in mind: 


1. Every point on an individual's profile should be based upon the same norm 
group as every other point, or at least upon highly similar groups. It is dangerous 
and misleading, for example, to have the percentile point representing the pupil's 
score on a clerical aptitude test determined by comparison with the scores of 
"employed males," while the percentile point for his mechanical aptitude test score 
is based upon “engineering college freshmen.” Similarly, the student taking both 
Latin and the required course in English is usually competing with rather select 
students in the first class. If he has a percentile rank of 50 (exactly average) in 
Latin and 75 in English, this does not necessarily mean that his knowledge of 
English is superior to that of Latin. The typical student in the Latin class may 
have a percentile rank of 75 in English when compared with all students in the 
English class. This problem does not arise with a battery of tests that have all 
been standardized upon the same group. It is minimized when some procedure for 
obtaining equivalent scores is available. 

2. Since differences between scores are less reliable than the separate scores 
themselves, only large diserepancies in the individual's profile should be interpreted 
as indicating better achievement in one area than in the other. Many teachers try 
to use for diagnostic purposes slight or moderate differences that may be due 
entirely to chance. With a group profile based upon average scores in a school grade 
of even as many as 30 students, however, the differences do not need to be as large 
in order to be both statistically and educationally significant, for averages are much 
more reliable than the scores from which they are obtained. 

3. Percentile ranks do not form an equal-unit scale, so they should be spread 
out at both ends and condensed in the middle when being used as the basis for 
points in a profile. See Figure 37 on page 297. 


22 A master’s thesis summarized in Charles C. Peters (Editor), Abstracts of Studies in 
Education at The Pennsylvania State College, Part IV, 1936, pages 27-28. 

3 Arthur E. Traxler, “The Use of Test Results in Secondary Schools,” Educational 
Records Bulletin, 25: 8-9, 1938. 


244 THE TESTING PROGRAM 


| Centile | Theoretical Есопотіс Aesthetic Social Political 
zi How Eins E x g и 
Gp 53-55 62-65 56-58 59-62 62-6; 51-56 57-59 58-60 52-53 60-63 67-ТО 
51 51 58-59 6-66 


51 5h 56-57 63 
50 lg 55 62 


9 52-53 53-51 60-61. 
52 59 


23-2 29-30 


22 28 
20-21 26-27 


19 . 05 
17-18 23-2 


j 27 28-29 25-26 15-16 21-22 
1 | 24-26 18-20 (19-22)20-22 13-16 20-23 20-22 21-26 26-27 23-21 11-14 17-20 


Figure 11. Centile Sheet for College Men and Women in Allport, Lindzey, and 
Vernon, A Study of Values, Based upon the Norms (851 Men, 965 Women) in the 
Manual of Directions (Boston: Houghton Mifflin Company, 1951). The Six Scores of 
John Doe are Plotted. 


Figure 11 illustrates the above three precautions and also shows how sex 
differences can be taken into account. “Study of Values” scores are shown 
for a certain college sophomore who secured 67 points on the theoretical 
scale, which is above the 99th (per)centile, and only 20 points for the eco- 
nomie value. Clearly, this man's dominant value among the six is the- 
oretical. Whether he actually has less of the economic attitude than of 
the political cannot be decided from this profile, since these two centiles 
(1 and 8) are negligibly different. Likewise, it is hazardous to say that he 
is less religious than aesthetic, or less aesthetic than social, for these three 

* An explanation of the basis for this table is contained in Julian C. Stanley, “Study 


of Values Profiles Adjusted for Sex and Variability Differences,” Journal of Applied 
Psychology, 81: 472-473, December, 1953. 


STEPS IN THE TESTING PROGRAM 245 


values all occur rather close together near the middle of the profile. There- 
fore, éven with such an extreme,profile we can safely say only that he is 
highest on theoretical and social, lowest on economic and political, and 
not particularly high or low on religious and aesthetic. 

Reports to parents or public. Only a few schools make a systematic 
effort to keep the public informed regarding the educational progress of its 
schools. Results of the testing program might very well be summarized 
before the Parent-Teacher Association, women’s clubs, luncheon clubs, and 
similar organizations. Slides and charts, illustrating the nature of the tests, 
with analysis and interpretation of the records of typical pupils, would be 
instructive. The cumulative record cards are naturally of great value in 
conferences with parents regarding the educational program of their chil- 
dren. Hilkert® points out clearly how this may be done. Unless parents, 
as well as teachers, administrators, and students, participate in the inter- 
pretation of test results and the planning of action based at least partly 
upon them, the testing program will not be optimally effective. A further 
discussion of the use of measurement in programs of public relations is 
given in Chapter 16. 


' SELECTED REFERENCES FOR FURTHER READING 


Bennett, George K., Seashore, Harold G., and Wesman, Alexander G., A Manual 
for the Differential Aptitude Tests (Second Edition). New York: The Psychological 
Corporation, 1952. 77 pages. 

Buros, Oscar K. (Editor), The Fourth Mental Measurements Yearbook. Highland 
Park, New Jersey: Gryphon Press, 1953. 1163 pages. 

Buros, Oscar К. (Editor), Succeeding Mental Measurements Yearbooks published by 
the author, Highland Park, New Jersey. 

Coleman, William, T'est Results for Curriculum Study: Annual Report of the Tennessee 
State Testing Program, 1950-51. Nashville, Tennessee: Tennessee State Depart- 
ment of Edueation, 1951. 18 pages. 

Coleman, William, and Cobb, E. B., The Guidance Use of Test Results. Knoxville, 
Tennessee: The Tennessee State Testing ор University of Tennessee, 
November, 1951. 47 pages. 

Cronbach, Lee J., Essentials of Psychowgical Testing. New York: Harper & 
Brothers, 1949. Chapters 4 and 5, “How to Choose Tests" and “How to Give 
Tests.” 

Flanagan, John C., Adkins, Dorothy, and Cadwell, Dorothy H. B., Major Develop- 
ments in Examining Methods. Chicago: Civil Service Assembly, 1313 East 60th 
Street, November, 1950. 24 pages. 

Jordan, A. M., Measurement in Education. New York: McGraw-Hill Book Com- 
pany, 1953. Chapter 4, "The Testing Program—Achievement-Test Batteries.” 

Lindquist, E. F. (Editor), Educational Measurement. Washington, D. C.: American 
Council on Education, 1951. Chapters 6, 10, 11, and 12: “Planning the Objective 
Test,” by К. W. Vaughan; “Administering and Scoring the Objective Test," 


35 Robert №. Hilkert, “Parents and Cumulative Records," Educational Record, 21: 
172-183, Supplement No, 13, January, 1940, 


246 THE TESTING PROGRAM 


by Arthur E. Traxler; “Reproducing the Test,” by Geraldine Spaulding; and 
"Performance Tests of Educational Achievement," by David G. Ryans and 
Norman Frederiksen. 4 

National Committee on Cumulative Records, Handbook of Cumulative Records. 
Washington, D. C.: U. S. Office of Education, 1944. 104 pages. 

Stephenson, William, Testing School Children; An Essay in Educational and Social 
Psychology. New York: Longmans, Green and Company, 1949. Chapter IX, 
“Principles and Practice of Selection.” 

Super, Donald E., Appraising Vocational Fitness by Means of Psychological Tests. 
New York: Harper, 1949. Chapters IV and V, “The Nature of Aptitudes and 
Aptitude Tests" and “Test Administration and Scoring." 

Traxler, Arthur E.; Jacobs, Robert; Selover, Margaret; and Townsend, Agatha, 
Introduction to Testing and the Use of Test Results in Public Schools. New York: 
Harper & Brothers, 1953. 113 pages. 

Worcester, D. A., *A Misuse of Group Tests of Intelligence in the School," Edu- 
cational and Psychological Measurement, 7: 779—781, Winter, 1947. 

World Book Company, Yonkers-on-Hudson, New York. The following free publica- 
tions, most of them reprints of articles from professional journals, may be of 
interest in connection with this chapter: 

Durost, Walter N., *What Constitutes a Minimal School Testing Program," 
Test Service Notebook No. 1. 

Durost, Walter N., “Tests and the Junior High School Guidance Counselor,” 
Test Service Notebook No. 2. 

Lewin, Lillie, “Pupil Adjustment Through Measurement,” Test Service Bulle- 
tin No. 40. Р 

Burnside, Carolyn J., “Improving the Reading of Seventh Graders,” Test 
Service Bulletin No. 44. 

Super, Donald E., “The Place of Aptitude Testing in the Public Schools,” 
Test Service Bulletin No. 49. 

Starkey, Mary L., "Determining Individual Needs and Capacities Through 
Testing,” Test Service Bulletin No. 56. 

Stenquist, John L., “Growth,” Test Service Bulletin No. 59. 

Bridges, Claude F., “Some Basic Considerations in Determining the Signifi- 
cance of Achievement Test Results,” Test Service Bulletin No. 66. 

Brown, Woodrow A., “Testing in Pennsylvania’s Public Kindergartens,” Test 
Service Bulletin No. 67. 


а, 


The Graphical Representation of 
Educational Data 


A. The Value of Graphs 


“One picture is worth ten thousand words." So runs an old Chinese 
proverb. “There is a magic in graphs,” says a modern writer. He describes 
the dynamic role of the graphical representation of numerical data as 
follows: 

Words have wings, but graphs interpret. Graphs are pure quantity stripped of 
verbal sham, reduced to dimension, vivid, unescapable. . . . Wherever there are 


data to record, inferences to draw, or facts to tell, graphs furnish the unrivaled 
means whose power we are just beginning to realize and to apply. 


There can be little doubt that the graphical representation of educa- 
tional data is a valuable supplement to statistical analysis and summariza- 
tion. The psychological value of graphs in the testing program may be 
considered under three headings: They attract attention, they clarify the 
meaning, and they aid retention. 

Graphs attract attention. In the first place, the graph or chart tends 
to attract the reader’s attention. Advertisers employ a wide variety of 
pictures, charts, and diagrams, for they realize that the first step in making 
a sale is to attract the prospective customer’s attention. They have learned 
that pictures will do this where numerical data and printed material will 
not. The average reader is likely to give scant attention to the ordinary 
printed matter in a school report and be wholly unimpressed by the ap- 


‘Henry D. Hubbard, quoted by W. C. Brinton, Graphic Presentation, page 2. New 
York: Brinton Associates, 1939. 
247 


BACK TO SCHOOL 
UNITED STATES, 1900 - 1953 


ENROLLMENT 


IN MILLIONS 


Г ] tenen epucation 
THERE 
ecc Mi SECONDARY SCHOOLS 29.6 


Z 
OTHER A ELEMENTARY. SCHOOLS 


PUBLIC 


200 
LEE, 
ZisZ 


1900 1910 1930 1940 "1950 953 * 
ESTIMATED 
1900 PER GENT DISTRIBUTION 1953* = 
17.2 =100% 34.4=100% 


Figure 12. Back to School, United States, 1900-1953. (Used by Permission of the 
National Industrial Conference Board.) 


248 


GRAPHICAL REPRESENTATION OF DATA 249 


palling mass of tabular data often piled up at the end, but his eye is likely 
to be arrested by any pieture or chart that may happen to be included. 
And this may lead him to read the entire discussion. There is evidence that 
school administrators are beginning to learn this lesson? 

Graphs clarify points. In the second place, the graph is often an effec- 
tive method of clarifying a point. One small chart will often make a point 
clearer than a dozen tables or paragraphs. It is sometimes said that the 
facts speak for themselves. In reality, statistics often stand speechless and 
silent, tables are tongue-tied, and only the chart cries aloud its message 
to all the world. Ordinary numerical data are quite abstract; they convey 
their meaning vaguely and with effort to the average mind. The picture 
or graph is a more concrete representation of the matter. 

Educational facts, such as comparative enrollment figures over a 53-year 
period, may be presented effectively by graphical means, as shown in 
Figure 12. There both bar and pie charts are изе to contrast the 17.2 mil- 
lion persons enrolled in five types of schools in 1900 with the 34.4 million 
enrolled in 1953. The bar charts show actual numbers, while the two pie 
charts have percentage slices. 

Figure 13 represents strikingly the enormous inequalities among states 
in the support of public education.’ 

A wide variety of charts are shown in Figure 14. The basic information 
concerning “Motor Buses in Operation in the United States” is given first 
in tabular form, followed by 15 different black-and-white charts.‘ 

Graphs aid retention. It has been found that the graphical presenta- 
tion of certain types of data is a definite aid to recall. Washburne* com- 
pared the efficiency of graphical, tabular, and textual modes of presenting 
historical data to pupils in the junior high school. The material, which 
dealt with certain specific quantitative facts, was kept constant, but the 
mode of presentation varied. Sometimes it appeared as a statistical table, 
sometimes as a bar graph, a pictograph, or a line graph, and at other times 
it was presented in ordinary paragraph form. Among the conclusions ar- 
rived at by the author were the following: 


1. The paragraph is, in general, the form which is least favorable to recall of 
quantitative data, whether general or specific. > 

2. The bar graph is the form most favorable to the recall of relative amounts 
(statie comparisons) when the comparisons called for involve a fair degree of diffi- 


2 Cf. Douglas E. Seates, "Reporting, Summarizing and Supplementing Educational 
Research," Review of Educational Research, 12: 558-574, December, 1942. 

3 John К. Norton and Eugene S. Lawler, Unfinished Business in American Education, 
page 13. Washington: American Council on Education, 1946. 

* This material appears on the first two inside pages of Mary Eleanor Spear's Charting 
Statistics. New York: McGraw-Hill Book Company, Inc., 1952. 253 pages. 

5 John Noble Washburne, “An Experimental Study of Various Graphic, Tabular and 
Textual Methods of Presenting Quantitative Material," Jaurnal of Educational Psychol- 
ogy, 18: 361-476, September and October, 1927. 

6 Ibid., page 475. 


4 M 
——— — BIMBI 


RTI WHY 
RAM 
аана оз ЩА 
трна Y 
рису opouy "Z 


00175 950 ЕАР Ород 


$21015 gy ou ut 
pun WOOJsSD]> upipew эЧ 10у eanjpuedx- jus.) 


"ene A t 


ayy ЧАН Чиа ydury reg xo[duro;) yey V “eT 91n214 


uero yv 
EUER NR wy 

ES y 

ТИС ut 


NYIO3IN 1YNOILVN 


o01$7 f 


6:0 чі SWOONSSV12 30 1 = } 


ccs ed Hn 

row '9E -»l'st "e Е 
nr ou MALI 
^ зохәр 1С — iueuióA "0E 
Lx сэх MON 8с PMOL "LT vv 

Fg spsupcg'9z  pubjAiow SE ^ 
Чо ФЕ 7192 "EZ "wow “ZZ Y y Y 4 y y и 
uuw LZ .'PUI'OZ "шон MPN SL 


Y 
06220 "Z1 


WHICH CHART TO USE ? THÉ CURVÉ CHART 


THOUSANDS 
THE DATA 100 
Motor Buses in Operation in United States 
b reed L--MES E 3 foot REVENUE MOTOR BUSES 
ion 1840 37955 2393 87400 14509 In Operation 
1942 22710 44101 240 7900 148.211 
1943 28.54 — 45610 200 77,850 153,964 
1944 28000 48525 3300 75500 155,325 
1945 29,000 45955 1,033 83,228 159216 
1946 30260 — 47760 1:475 82,500 161.995 
1947 31.900 541 3000 85900 174900 
1948 31775 — 57175 3200 90400 182.550 
1949 30/200 57; 350 — 97600 189,100 o 
(D Omit wolley busse. C) Excloaive of common carrier buses dong echoolwork, 
SOURCE "Bus Transportation” os of December Jii. 1941 1945 1949 
THE GROUPED COLUMN THE BAR CHART 1949 
Thousonds of Busos Thousands of Buses’ 
vu WüREvENUE — ZZscHooL SCHOOL 
75 — 
LOCAL 
50 
INTERCITY 
25 
CHARTERS Ё 
о $ SIGHTSEEING 
1941 1943 1945 1947 1949 
THE CUMULATIVE CURVE THE SUBDIVIDED SURFACE 


Thousonds of School Buses 
800 Thousands of Buses 
100 


600 80 
400 vi 
40 
200 
20 
о о i is 3 
194 1945 1949 1941 1945 1949 
THE SLIDING BAR THE PICTOGRAM 


Thousonds of Buses 
190 SCHOOL о REVENUE 100 


ои yy 1941 
022222 1945 
1948 222 ETT 
э 22222 1999 


Totol Number of Buses 91,500 


92,150 


Figure 14. Motor Buses in Operation in the United States—Fifteen Different Charts Based 
Upon the Same Data. 


252 


THE INDEX CHART THE LOGARITHMIC CHART 


oEX 
50 19412100 


CHARTER 
50 BUSES 


о «o 
1941 1945 1949 1941 1945 1949 


THE SUBDIVIDED COLUMN THE PAIRED BAR 


Thoosonds of Buses Percent: School Buses 
9 0 60 


Thoveonds of Buses 
200 


THE BAR and SYMBOL THE COLUMN and CURVE 


2 Thousands of Buses 


00 улы of Buses 


SCHOOL, C7 77 
"n 
LOCAL 7 f г 
2 y^ WAY ; Giz #7 7 
INTERGITY 7 f / Y ] ] / 
2 
ganten а WN WW 
1941 1945 1949 
THE PICTORIAL SURFACE THE PIE CHART 


Percent of 


THOUSANDS OF INTERCITY BUSES Foret Bones 
40 


B] Revenve 
School 


1941 1945 1949 


By permission from Charting Statistics, by Mary Eleanor Spear. Copyright, 1952. McGraw- 
Hill Book Company, Ine. 
253 


954 THE TESTING PROGRAM 


culty. For very simple data some form of pictograph may be more favorable to the 
recall of relative amounts than the bar graph. 

3. The line graph is the form most favorable to the recall of relative increase, 
decrease, and fluctuation (dynamic comparisons). 

4. The statistical table is the form most favorable to the recall of specific 
amounts. 


One study? on graph interpretation in the elementary schools points out 
that little is known regarding the comparative value of various graphs, 
although the circle graph appears to be easiest and the line graph most 
difficult, with the bar graph occupying an intermediate position. Her results 
indicated that a mental age of 14 years was required for the satisfactory 
interpretation of bar and line graphs without specific instruction in reading 
materials presented in graphical form. 

These findings appear to be in line with a principle of learning abun- 
dantly supported by experimental evidence: namely, that the method of 
presentation which makes the meaning clearest is most favorable to learn- 
ing and recall. It is important to recognize that neither statistical nor 
graphical methods bestow precision upon data. They are merely useful 
ways of expressing whatever accuracy exists. 


B. Representing the Record of an Individual 


"There is no more striking way of representing the test record of an indi- 
vidual pupil than by means of a graph. Such a graphical picture of the 
strong and weak points of a single person is called a profile. Sometimes the 
term psychograph is used. Many publishers of standard tests provide blank 
forms for showing these profiles. Usually they appear on the first page 
of the test, where they can easily be detached for filing. 

Profiles of a single subject. Figure 15 shows the profile of a sixth-grade 
pupil on the Iowa Silent Reading Tests, New Edition. The broken-line 
profile for the class, based on medians, is also shown. With a median stand- 
ard score of 150, corresponding to a percentile rank of 49 for the eighth 
month of the sixth grade, John is an average reader. The median score 
of his class is 151, which has a percentile rank of 52. John scored highest 
on Tests 4 (Paragraph Comprehension) and 6A (Alphabetizing), poorest on 
Test 6B (Use of Index). He exceeded the class average considerably on 
Tests 1R (Rate) and 6A (Alphabetizing). 

Profiles for a series of subjects. Profiles are especially useful in repre- 
senting a pupil's record on two or more subjects. Most test batteries provide 
a convenient form for such a profile. Figure 16 shows the profile for a 
tenth-grade pupil on the California Achievement Tests, Advanced Battery. 

This student scored highest on syntax (Grade Placement — 15) and low- 
est on spelling (GP — 6.7). His best area is Test 5, Mechanies of English 


E Sister Clara Francis Bamberger, “Interpretation of Graphs at the Elementary School 
Level," Educational Research Monographs, 13: 1-62, May 1, 1942, 


GRAPHICAL REPRESENTATION OF DATA 255 


IOWA SILENT READING TESTS : 


NEW EDITION 
By H. А. Greene 
Director, Bureau of Educativas) Kestar hand Service, University of lowa 


and V. Н. Калу 
University Appointment Oce, University of Arissa, Tucson, Arizona 


ELEMENTARY TEST: FORM Ам 


(Revised) 
хь. DOE, JOHN Fc NN ti ey С И Age s. IT... IL. enn Ё : 
Я E Date... ...../ MAY. 3... 554 Teacher.. MISS, SMLTH....... 
school... CENTRAL EEE City and sate METROPOL!S, Wis. д 


PROFILE CHART 


Rate: A+B 
Comprehension: A+B 


Directed Reading 


Published 1943 by World Book Company, Yonkers-on-Hudson, New York, und Chicago, Illinois. H 
Copyright 1933, 1939, by World Book Company. Copyrightin Great Britain. АВ rights reversed. raum p usa watin ам. лм $ 
[LAU usi a ео, 


I This test is copyrighted. The reproduction of any part of it by mimeograph, hectograph, or in any other Edition @ i 
the reproductions are sold or are furnished free for use, is a violation of the copyright lau. Н 


way, 


Figure 15. Profile of a Pupil and the Sixth-Grade Class of Which He is a Member. 
(Reproduced by Permission of World Book Company.) 


and Grammar (GP = 12.5). On the complete battery he has a dead-center- 
average percentile rank of 50 and a grade placement of 10.6. 

After a period of remedial instruction, the purpose of which is to strengthen 
the weak points of individual pupils, it is a good practice to give a second 
form of the same test. A second profile drawn in a different color upon the 
same sheet is one of the best ways of revealing the progress made, if changes 


are interpreted cautiously. 


SAMPLE PROFILE — COMPLETE BATTERY 
A Test Given in January to а 10th Grade Student. Age, 183 Months. Mentol Age, 192 Months, 


4% ~ DIAGNOSTIC PROFILE (Chart Students Scores Here 


S Grade Plocement 
х на 
$ i 3! 60 70 80 80 100 110 120 120 140 1 


$ 
Tue vem 9 ioc 8:5 467.724 Лабе it QD n Ww s commo 
ak B. Science - - - =- -- 23 8 e ее i 1314 05 16 07 18 19 
£i C. Social Scenes - - - - 22 LE и 16 17 ow 29 
ij D. General- - - - - - - Bs A ea ae 6 зн 
=> 
TOTAL (д+в+с+о) 90 8] MASO) в» 
z 
9 E. Following Directions - 10 = ze ps 
ЕЕ Reference Skills --- 15-2... 222. 
FEI с Interpretations = = - 30 Al TIR ENS 
Р] 
= 
AO (TOTAL ceno ss Ba] [2 [48199] s 


TRETEN JO a СГ. МИ 
TOTAL READING 145 [72] [850] 30 40 50 60 I 3050 100. 110 120 


3 A. Number Concept - - 20 4 

LE B. Symbols and Rules - 15 

iz C. Numbers G Equations 10 7 

za 7 

es [ D. Problems 

iz 

m 

57 Addition - - - . - - ju Е | ыы 15 п ню 

RETF EN pase 20 /5 TE 789 100 2 m 7 "on 
z iD ма т т a 

D G. Multiplication - - 20 10 Ее EIE M 4 5 6 7 8 9 10 И es ere LU C) 

12 H. Division i BW "is s 
z ттт 

<2 TOTAL (e+F+G+H) зо 2] [20 [2020] : 22 30 35 40 45 E) 55 60 65 — 70 73 

TOTAL MATH. 140 195] WA AS) ; 3240 50 60 70 75 7-9: 90 95 100 105 110 115 120 

I A. Capitalization... 15 Lee 

5 6 

GZ fe Punctuation - .... SR dE 

23 al 

t 2 C. Words and Sentences 25 24 

SŠ |o. Ports ot Speech - - 17 LH 

zo 

FE E: Syntax << <1 are лз 42 10 

5 Lrotatiate+c+o+e: 80 [23] [1285] ss s 

a 

zo 

gz тота. spewing 30 [2 ] [27157] a а la li x 

~ 


25 35 ij 45 i sl 60 L 75 7 d № м. 


TOTAL LANGUAGE 110 [27] 7160] 
Handwriting 
TOTAL TEST — 395 


Grade Placement. 


85 125150 175 200 oe 275 300 325 345 


6.0 70 80 90 10.0 11.0 12.0 130 140 15 
LR RR IT PR 


inrer. G.P. 10.3 
ACTUAL. G.P. 10.4 
CHRON. G.P. 9,9 


Figure 16. The Profile of a Tenth-Grade Pupil on the California Achievement Test. 
(Reproduced by permission of the California Test Bureau.) 


256 


Percentile Rank 


GRAPHICAL REPRESENTATION OF DATA 257 


Attainment 


vm 
Subtests 


Figure l7. Profiles for a Student Tested in the Fifth and Sixth Grades. (Reproduced 
by permission of Educational Test Bureau.) 


Figure 17 is a “Normal Progress Chart” issued by the Educational Test 
Bureau for use with their Coordinated Scales of Attainment. It shows two 
profiles for Mary L. , one (solid line) based upon her perform- 
ance on Battery 5, Form A, on January 18 in the fifth grade and the other 
(broken line) for Battery 6, Form B, administered at the same point in the 
sixth grade. The grade equivalents of her scores varied from 5.2 for science 
in the fifth grade to 7.9 for English and literature in the sixth grade. Mary’s 


258 THE TESTING PROGRAM 


overall percentile ranks in the fifth and sixth grades are 55 and 60, respec- 
tively, while the percentile rank equivalent of her Kuhlmann-Finch IQ 
of 111 is 75. Thus, even though she is slightly above the average of the 
class in achievement, Mary seems to be working somewhat below her 
potential ability to attain. 


C. Representing a Frequency Distribution 


The ordinary frequency distribution does not give a very clear picture 
of the situation. There are three common methods of representing а dis- 
tribution of scores graphically: the histogram or column diagram, the fre- 
quency polygon, and the smooth curve. 

The histogram or column diagram. The histogram is a series of col- 
umns, each of which has as its base one class interval and as its height the 
number of cases, or frequency, in that class. Figure 18 represents a histo- 


NUMBER OF SCORERS 
© —m 06 5 c Oo чо о 


27 32 37 42 47 52 57 62 67 72 77 82 
PERCENTAGES ASSIGNED 


Figure 18. A Histogram, or Column Diagram, Representing the Percentage Values 
Assigned to an Arithmetic Paper by Forty-Two Scorers. 


gram showing the distribution of percentage values assigned to an arith- 
metic paper by forty-two scorers. As the greatest frequency is 9, in the 
59.5-64.5 class, it is not necessary to extend the vertical or frequency scale 
at the left above 9. As the scores range from the 29.5-34.5 class to the 
74.5-79.5 class, it is necessary to represent the horizontal scale only through 
that distance. It is customary, however, to extend the scale one class in- 
terval above and below that range. In order to avoid having the figure too 
flat or too steep, it is usually well to arrange the scales so that the width 
of the figure is about one and two thirds times its height—that is, the ratio 
of height to width should be approximately 3:5. In actual practice it is 
customary to represent the histogram in outline form, rather than to show 


the full length of the columns. Figure 19 illustrates the shaded outline form 
of the histogram. 


GRAPHICAL REPRESENTATION OF DATA 259 


NUMBER OF PUPILS 


INTELLIGENCE QUOTIENT 


Figure 19. A Histogram, or Column Diagram, Representing the Distribution of the 
83 IQ's in a Small Junior High School. 


The frequency polygon. The process of constructing the frequency 
polygon is very much like that of constructing the histogram. In the histo- 
gram, the top of each column is indicated by a horizontal line the length 
of one class interval, placed at the proper height to represent the frequency 
at that class. But in the polygon a point is located above the mid-point of 
each class interval and at the proper height to represent the frequency 


NUMBER OF SCORERS 
w Pb a Oo 


2 
3 22 27 32 37 42 47 52 57 e2 er 72, (7 82 87 
PERCENTAGES ASSIGNED 


Figure 20. .A Frequency Polygon Representing the Percentaxe Values Assigned to an 
_ Arithmetie Paper by Forty-Two Scorers. 


260 THE TESTING PROGRAM 


at that class. These points are then joined by straight lines. As the fre- 
quency is zero at the classes above and below those in the distribution, the 
polygon is completed by connecting the points that represent the highest 
and lowest classes with the base line at the mid-points of the class intervals 
next above and below. Figure 20 shows a polygon for the same data repre- 
sented by a histogram in Figure 18. 

The smooth curve. Sometimes a smooth curve is drawn instead of the 
histogram or frequency polygon. The only difference is that for the former 
a smooth curve is drawn through the points, and for the latter two igures 
a jagged line is used. The most common use in educational measurement 
of a smooth curve is in the so-called normal curve. Figure 21 shows such a 
curve superimposed upon a histogram representing the actual distribution 
of ninth-grade pupils on eleven intelligence tests. 


eese. 


cts 1 0 1 2 3 4 


Figure 21. _ An Actual Curve Compared with the Theoretical Curve of Probability. 
Actual curve is ‚а histogram based upon curves for eleven well-known group intelli- 
gence tests administered to the ninth grade. The dotted line represents the theoretical 
(normal) curve. (From E. L. Thorndike's The M. easurement of Intelligence, Bureau of 
Publications, Teachers College, Columbia Universit; , page 529.) 


&D-—- -3 


There is one smooth curve, however, which is widely used in representing 
test scores. This is the percentile curve, or ogive. Figure 22 shows a percentile 
curve used to represent the percentage data already employed to illustrate 
the histogram and the polygon. The points that determine the percentile 
curve are located on the horizontal line at the upper limit of each class, 
at the position that indicates on the horizontal scale the percentage of 
Scores up to and including that class. It will be noted, also, that two col- 
umns have been added to the ordinary frequency table. The cumulative 
frequency column indicates the number of scores up to and including each 
class. For example, there is one score in the 30-34 class, and there are two 
in the 35-39 class, making a cumulative frequency of 3 in the two lowest 


GRAPHICAL REPRESENTATION OF DATA 261 


>| 0 
SELENE Ali 
рез [a Fed ot 
Figure 22. А Percentile Curve Representing the Percentage Values Assigned to an 
Arithmetic Paper by Forty-Two Scorers. 

classes. The cumulative per cent column shows what percentage each of 
these cumulative frequencies is of the total. In the illustration the total, 
N, is 42. The first entry in this column is, of course, 100; the second is 98, 
because 41 is 98 per cent of 42; the third is 95, because 40 is 95 per cent 
of 42; and so on for the others. Each value in the cumulative per cent col- 
umn is represented as a point on the upper limit of that class interval (the 


PERCENTILE SCALE 
30 40 50 60 70 80 90 100 


Q, ма 0, 


Figure 23. А Percentile Curve Representing the Distribution of 83 IQ's in a Small 
Junior High School (see Figure 19). The Values of Qi, Median (Q:), and Q: Read from 
the Curve Are Shown with the Computed Values (in Parentheses). 


202 THE TESTING PROGRAM 


J 


о Low HIGH O LOW HIGH 
SCORE SCORE 


«oózmcomzm 


«ozmcomazm 


Figure 24. Negative and Positive Skewness. 


horizontal line separating that class from the class above it), since it in- 
cludes the percentage of scores up through that class. These points deter- 
mine the eurve. As a rule, especially in small groups where irregularities 
are most likely to occur, it is best to miss some of the points in order to 
obtain a smooth and regular curve; but care should be exercised in order 
to leave about as many points on one side of the line as on the other. 
Figure 23 shows another ogive. Such a smoothed curve, although it does 
not exactly represent the actual sampling, probably indicates very closely 
what is to be expected “in the long run." 

Symmetrical and skewed curves. Regardless of whether a distribu- 
tion is represented as a histogram, a polygon, or a smooth curve, the curve 
will be either symmetrical in shape, or else pushed or pulled to the right 
or left. A symmetrical curve is balanced in the center and slopes regularly 
in both directions. One that is pushed or pulled in one direction is said to be 
skewed. If the peak of the curve is toward the upper end of the scale, with 
the longest slope downward toward the lower end of the scale, the curve is 
negatively skewed (skewed to the left). On the other hand, if the peak of 
the curve is toward the lower end of the scale, with the longest slope 
toward the higher end of the scale, the curve is positively skewed, or 
skewed to the right. Both kinds of curves are shown in Figure 24. Many 


IQ f 
145- 1). T x 
0-1 2 xx 
135- B 2 xx 
mr i 
- XXXXXXXX 
120-1 X 
115- t 22 XXXXXXXXXXXXXXXX 
110-1 12 XXXXXXXXXXXX 
105-109 10 XXXXXXXXXX 
хоо 3 XXXXXXXX 
- XXXXXX 
0- d 8 XXXXXXXX 
5- 2 6 XXXXXX 
80- 8 1 X 
75- 79 2 X 


Figure 25. Bar Graph Made on the Typewriter, Showing the Distribution of 91 IQ's 
in a Junior High School. 


GRAPHICAL REPRESENTATION OF DATA 263 


curves met with in educational measurement show some skewness, although 
the departure from symmetry is usually not very great in larger samplings 
unless some selective factors are operating. 

Typewriter graphs. A satisfactory bar graph can be made on the type- 
writer. Figures 25, 26, and 27 illustrate this type of graph. Other graphs, 
such as the circle, or pie graph, and various picture graphs, or pictographs, 
are occasionally met with in educational measurement; these are illustrated 
in Figures 12, 13, and 14. 


12- years 
Dl А ndi Aude a a 


|x 12 years 


2222 12} years 
[onanan l3 years 
nnnm 13} year|s 
[pnm 14 year|s 
2200000220 14j уеагз 
o Fa To Yo Fa "6 To 7. To To To To Fo 
enero mmm | ee years 


—— | | |_—_; —_|—_—__|_1_| 
100 50 65 40 20 20 40 60 80 100 


Percentage Not Graduating Percentage Graduating 


Figure 26. Bar Graph Made on the Typewriter, Showing the Percentage of Pupils 
of Each Age Group Who Were Graduated from High School and the Percentage Who 
Entered High School but Did Not Graduate. 


years 


Which graph is best? As is to be expected, no one type of graph is 
equally good for all purposes. The histogram is the easiest of all to under- 
stand and is usually best if but one distribution is being represented. If two 
or more distributions are to be compared, however, polygons are usually 
better, since so many lines coincide when histograms are superimposed 
that the picture is likely to be confusing. The percentile curve has many 
advantages not possessed by other curves. The first of these is that it is 
possible to estimate with a high degree of accuracy the quartiles, medians, 
and other similar points. This means that one can read directly from the 
curve percentile measures like those illustrated in Figure 23. As will be 
shown in the next section, by means of percentile curves several groups 


264 THE TESTING PROGRAM 


can be presented, for convenient comparison, on a single sheet. The prin- 
cipal value of bar graphs, circle graphs, and picture graphs lies probably 
in school publicity and in the motivation of learning. “А successful graph,” 
as Scates points out, “depends far more on careful thought and judgment 
than on techniques."'* 


D. Representing Two or More Distributions 


There are many occasions when it is desirable to compare two or more 
distributions. For example, school administrators may wish to compare the 
intelligence or achievement of the pupils in various classrooms or buildings. 
The overlapping among the various grades within a single building is a 
striking way to present the need for individualized instruction and varied 
materials. 

Representing entire distributions. When it is important to compare 
two or more entire distributions, as would be the case in a study of the 
status of a school or school system, the choice will usually lie between the 
frequency polygon and the percentile curve. The difficulty of superimposing 
two or more histograms has already been pointed out. A series of polygons 
may be drawn on the same sheet one above the other, or alongside each 


=. DEwRT 
for Grade 
pate Е 


200-219 
180-199 
160-179 
1lo- 159 
120-139 
100-119 


Bar Graph for School Grade 


Seventh 


999 
99999 
9999999 
9999999 
99999999999 
99 


T 
777 
"m 
71777777777 
7777 


8888 
888 
888888888 
8888888 
8888888 


99-59 7771 88 9 
60- 79 7 

ho. 59 

20- 39 8 


i|] YP ow vo а DO w = 


Figure27. Graph Made on the Typewriter, Showing the Overlapping of Grades Seven, 
Eight, and Nine in Reading Comprehension. 


other. Figure 27 illustrates a method of showing overlapping by bar graphs 
made on the typewriter. 

The use of polygons. The distinct advantage of polygons over histo- 
grams for representing a series of distributions is that polygons can be 
superimposed upon each other with less crossing of lines, In this form com- 

* Douglas E. Scates, op. cit., page 568, 


GRAPHICAL REPRESENTATION OF DATA 265 


parisons among distributions are more easily made. Figure 28 illustrates 
this possibility with the distribution of reading comprehension scores on 
the Iowa Silent Reading Test for the seventh, eighth, and ninth grades of 
a certain school. One faet stands out clearly, the great overlapping of the 


LEGEND 
Seventh 
— Eighth 


—— Ninth 


NUMBER OF PUPILS 
o 


! 
! 
f 
i 


95 295 495 695 695 1095 1295 1495 1695 1895 209.5 2295 2495 


Figure 28. Frequency Polygons Representing the Distribution of Reading Compre- 
hension Scores on the Iowa Silent Reading Tests for the Seventh, Eighth, and Ninth 
Grades of a Certain School. (Data from Figure 27.) 


three grades in reading ability. But even with only three distributions the 
lines cross and recross so many times as to make any accurate comparison 
of one grade with another somewhat difficult. More than three classes can 
hardly be represented in the same graph by frequency polygons without 
considerable confusion. It is also difficult to compare distributions accu- 
rately where the numbers of cases vary greatly, unless each frequency is 
represented as a per cent of its total. 

The use of percentile curves. For the graphic comparison of two or 
more distributions the percentile curve has certain outstanding advantages. 
Since the frequencies are reduced to per cents, it is readily possible to 
compare groups of-unequal size. Another important advantage is that sev- 
eral distributions ean be represented in a single graph without difficulty 
or confusion. Figure 29 shows the distribution of reading comprehension 
scores for the same grades as in Figure 28 in the form of a percentile curve. 

From these percentile curves several relationships are observable that 
were not apparent in the polygons. It is quite clear that although the 
seventh and eighth grades have almost exactly the same average scores, 
the eighth grade has greater variability. This is evident from the fact that 


266 THE TESTING PROGRAM 


PERCENTILE SCALE 


Q, Md ©, 


Figure 29. Total Comprehension Scores on the Iowa Silent Reading Tests for the 
Seventh, Eighth, and Ninth Grades. 


the upper half of the eighth grade exceeds the upper half of the seventh 
grade, but that the lower half of the eighth grade falls behind the lower 
half of the seventh. 

Furthermore, although the ninth grade runs rather consistently above 
the other two grades, about 15 per cent of the ninth-grade pupils fall below 
the median of the seventh and eighth grades. 


LEGENDO 
P ———— full knowledge 


—-— - — Partial Knowledge 


ин No Knowledge 


PRACTICE PERIODS 


Figure 30. The Learning of Three Groups Compared, One with Full Knowledge of 


Progress, One with Partial Knowledge of Progress, and One with No Knowledge of 
Progress. 


GRAPHICAL REPRESENTATION OF DATA 207 


Representing central tendencies of a series of distributions. It is 
frequently necessary to represent, not the entire distribution, but only the 
central tendencies or averages. A learning or progress curve is an illustra- 
tion. Figure 30 shows a graphic picture of the results of a learning experi- 
ment. It shows three groups, one with no knowledge of progress, one with 
partial knowledge of progress, and one with full knowledge of progress. 
It will be noted that after the second trial the progress was roughly propor- 
tional to the amount of knowledge possessed. A simple line graph makes 
this clear. 

160 


120 


t 4 5 6 7 8 9 


GRADE 


Figure 31. Correct and Incorrect Location of the Norms in a Line Chart Showing 
Median Scores on a Reading Test. 


Another common use of the line graph is for comparing two or more 
schools through several grades, or one school with the norms on a test. 
Figure 31 shows the correct and the incorrect construction of such a graph. 
The solid line connects the median scores on a reading test for grades four 
to nine, inclusive. The tests were given in October, or one-tenth of the way 
through the grade. The dash line connects the norms incorrectly drawn 
from norms in the manual for the end of the grade. The dot-dash line con- 
nects the norms at the proper grade location. It will be noted that when the 
line is incorrectly located only the seventh grade appears to exceed the 
norm, whereas in reality every grade does. Here the horizontal axis is con- 
sidered a scale and the points determining the lines are located with refer- 


ence to it. 


? (-Kweduro) yoog pom 
wad Ач poonpoadoy) `f wog 'oje[duro) &1ojj€eg рэоцзлру ‘вот, uaua oy piojuejg oq, Jo зав YOR uo 591025 ивтрэй{ 
Əy} Зи 99 О Aq эре и [0042$ usrg. 1orun f* ute в JO зэрвю Чук раз ВУЗЕ "yuoAeg 903 10; so[goiq opio “ZE эли 


(3uue3srumupy 10] SUOTIaIIG 99S)  "jueureoe[d эре18 pejsorput əy} jo spidnd jo 9eouguLi0;1ad 


[291d£4 eq, За1Ади81в se pe3eidiejut eq оз you рив sanj 


BA peje[odezxe әте ("01 eAoqe sanjea 3uo[eAmbo әре 


-1-4-]- 
fot a el tad bd a Gl de a 


-I-1-| |= 


1- en 
1-t-1-1-1-0-1-1-1-1- 
ИА 


БЕКИ] 


БЕЯ! 


268 


GRAPHICAL REPRESENTATION OF DATA 269 


Figure 32 shows the profiles for the seventh, eighth, and ninth grades 
of a certain junior high school made by connecting the median scores on 
each part of the Stanford Achievement Test. This figure shows clearly that 
the school is weak in spelling, arithmetic computation, and study skills, 
and particularly strong in the social studies. It is evident that this school 
is stressing the content subjects at the expense of some of the more formal 
tool subjects. Whether or not this appears to be a desirable emphasis de- 
pends upon one's philosophy of education. 

Representing the central tendencies and variabilities of a series 
of distributions. The variabilities, as well as the central tendencies, of 
a series of distributions may be shown in a similar manner by line graphs. 
Figure 33 is an illustration. This figure shows Q;, the median, and Q;, for 


Figure33. A Line Graph Showing the Medians and Quartiles for Grades Four to Nine, 
Inclusive, in Reading Comprehension. 


each grade from four to nine, inclusive, in reading comprehension. While 
the three lines have the same general shape, they converge slightly at the 
seventh grade, where the variability is least. It would be possible to include 
from the table of norms the corresponding medians and quartiles for the 
typical school, but to do so would make the figure too complicated for easy 
interpretation. 


210 THE TESTING PROGRAM 


EDUCATIONAL AGE 
= 
m 
GRADE NORMS 


п 
10 
2 LEGEND 
Vertical Line is Total Range 
Vertical Bar is Middle 50% 
8 QÀ ABOVE GRADE NORMS 
[ ^r crave norms 
№ весом GRADE nonus 
7 


Figure 34. The Central Tendency and Variability in Educational Age of Grades 2B 
to 9A, Inclusive, in a Small City School System. 


Figure 34 is a bar graph which shows the central tendency and variabil- 
ity in educational age of grades 2B to 9A, inclusive, in a small city school 
system.? In each grade the vertical line indicates the total range, the ver- 
tical bar indicates the range of the middle 50 per cent, and the middle of 
the bar is the approximate position of the median. The horizontal lines 
across the full width of the graph indicate the norms for the beginning of 


* Report of the Public Schoola-of Shelbyville, Kentucky, page 73. Bulletin of the Bureau 
of School Service, Vol. I, No. 1. Lexington: University of Kentucky, 1928. 


GRAPHICAL REPRESENTATION OF DATA 271 


each grade. It will be noted thatthe part of each bar which is crosshatched 
indieates the proportion that is above the grade norm, while the shaded 
part is the proportion that is below the grade norm. The overlapping is 
especially marked from 7B to 9B. This condition suggests the advisability 
of trying to find out whether these ninth-grade classes happened to be 
weaker than usual, or whether the teaching emphasis was responsible for the 
apparent lack of improvement. This type of graph is an effective means 
for presenting the essential features of a total situation. Here the amount 
of overlapping is impressive. It will be noted, for example, that those whose 
EA is 12-6 are found in all grades from 5B to 9A, and that pupils classified 
in 8A vary in EA from just above the 4A level to almost the 10A level. 


E. General Suggestions for Constructing Graphs 


Varied practice. A wide diversity of praetice will be found in the 
construetion of graphs as used in psychology and education. The title is 
sometimes placed above the graph, though usually it is placed below. In 
nearly all books and periodicals the graph title is placed below, but in un- 
published charts such as wall charts the title is often more effective when 
lettered above. The figures are numbered consecutively with Arabie nu- 
merals placed at the beginning of the title. Sometimes the title is written 
in capital letters, as in tables; sometimes the initial letters of all important 
words are capitals; and again, only the first word in the title is capitalized, 
unless there are proper names, in which case the usual rules for capitaliza- 
tion apply. The second of these methods is perhaps most common. 

Suggested standards. Years ago a committee composed of representa- 
tives of the various groups interested in ‘graphical methods prepared a 
report recommending certain standards for constructing graphs. This те- 
port still eovers most of the points required for the proper representation 
of educational data. The following rules are taken from it: 


1. The general arrangement of a diagram should proceed from left to right. 

2. Where possible represent quantities by linear magnitudes, as areas or vol- 
umes are more likely to be misinterpreted. : 

3. For a curve, the vertical scale, whenever practicable, should be so selected 
that the zero line will appear on the diagram. 

4. If the zero line of the vertical scale will not normally appear on the curve 
diagram, the zero line should be shown by the use of a horizontal break in the 


diagram. E 
5. The zero lines of the scales for a curve should be sharply distinguished from 


the other eoórdinate lines. : Mi. Р 
6. l'or curves having a scale representing percentages, it is usually desirable to 
emphasize in some distinctive way the 100 per cent line or other line used as a basis 
of comparison, , 
10 W, С. Brinton, Chairman, 
Старые Representation,” Quart 
14: 790-797, 1915. 


“Preliminary Report, Joint Committee on Standards of 
terly Publications of the American Statistical Association, 


* 272 THE TESTING PROGRAM 


7. When the scale of a diagram refers to dates, and the period represent 
not a complete unit, it is better not to emphasize the first and last ordin ite 
since such a diagram does not represent the beginning or end of time. ; 

8. When curves are drawn on logarithmic codrdinates, the limiting lines of thi 
diagram should each be at some power of ten on the logarithmic scales, я 

9. It is advisable not to show any more coórdinate lines than necessary to guide 
the eye in reading the diagram. 

10. The curve lines of a diagram should be sharply distinguished from the 

ll. In curves representing a series of observations, it is advisable, whene 
possible, to indicate clearly on the diagram all the curves representing the sep 
observations. 

12. The horizontal scale for curves should usually read from left to right 
the vertical seale from bottom to top. a 

13. Figures for the scales of a diagram should be placed at the left and at th 
bottom or along the respective axes. с 

14. It is often desirable to include in the diagram the numerical data or formula 
represented. я 

15. If numerical data are not included in the diagram, it is desirable to giv 
the data in tabular form accompanying the diagram. 

16. All lettering and all figures on a diagram should be placed so as to be ва 
read from the base as the bottom, or from the right-hand edge of the diagram as 
the bottom. 

17. The title of a diagram should be made as clear and complete as pos 
Sub-titles or descriptions should be added if necessary to insure clearness. - 


-" A useful manual which treats of the different phases of the construction 
of line charts has been prepared by the Committee on Standards for 
Graphic Presentation." For a fuller discussion of the general problem 0 
graphical representa don, several excellent books listed at the end of tl 


chapter are available. 3 1 


The suggestions given by ‘Spear should be kept constantly in mind wh 
constructing graphs:” 
"uu 
In the present day, when visual education in all aspects has become, not 0 
an aid to, but also a vital basis of learning, our attention is called moré- 
before to the almost limitless possibilities in this field. The eye absorbs writ 
statistics, but only slowly does the brain receive the message hidden behind writ! 
words and numbers. The correct graph, however, reveals that message briefly a 
simply. Its purposes, which follow, are clear from its context: 3 


1. Better comprehension of data than is possible with textual matter alone. 
2. More penetrating analysis of subject than is possible in written text. ' 


É 
WE 


H Time Series Charis: A Manual of Design and Construction, 68 pages. New York: 
American Society of Mechanical Engineers, 1938. р a: 1 


1? Quoted by permission from pages 3-4 of Mary Е 3 arti isli 
; y Eleanor Spear, Charting Statis 
New York: McGraw-Hill Book Company, Ine., 1952. 3 - 


4 


- 


GRAPHICAL REPRESENTATION OF DATA 273 


I. Determine the significant message in the data. 

2. Be familiar with all typés of and make the correct selection. 

3. Meet the audience on its own level; know and use all appropriate visual aids. 
4. Give detailed and intelligible instructions to the drafting room. 

5. Know the equipment and skills of the drafting room. 

6. Recognize effective results. 


Even when no technical assistance is available, teachers and administra- 
tors can make excellent use of graphs to facilitate the attainment of educa- 
tional objettives. 

SELECTED REFERENCES FOR FURTHER READING + 


Arkin, Hubert, and Colton, Raymond R., Graphs: How lo Make and Use Them. 
New York: Harper & Brothers, 1936. 224 pages: 

Brinton, Willard Cope, Graphic Presentation. New York »Brinton Associates, 1938. 
512 pages. $ 

Kelley, Truman L., v OE IY. of Statistics. Cambridge, Mass.: Harvard Uni- 
versity Press, 1947. ipter IV, “Graphic Methods.” 

Modley, Rudolph, How to Use Pictorial Statistics. New York: Harper & Brothers, 
1937. 170 pages. P 

"Presentation Pro a feature in the American Statistician since. August, 1947. 

Spear, Mary s Vetere inp New York: McGraw-Hill Book Company, 
1952. 253 

Thompson, ‘Meaning in Space," ETC: A Review c p Бань 
8: 193-201, Spring, | fo. s A ; 

Vernon, M. D,, “The Use and Value of Graphical Methods Ре Quantita- 


tive Data ‚рано Е o ud London, 26: 22-94, January, 1952. 


r2 
1 Lr і 
1 
4 
Р " v 
F 
k^ 
ani \. | 
м: ж” k 
E 
Gorath 


10 


The Uses and Limitations of Norms 


It is self-evident that the value of test scores will be dependent largely 
upon how well they are understood. The preceding chapter concerned the 
summarization of scores by graphical methods as an aid to their interpre- 
tation. The present chapter will consider some closely related problems of 
interpreting scores by the aid of norms. 


A. Norms and Standards 


Standardized versus nonstandardized tests. At the outset it is im- 
portant to distinguish clearly between a norm and a standard,! especially 
because the terms are frequently used interchangeably. The confusion 
doubtless arises over the fact that norms are used with standard tests and 
that a part of the process of standardization is the derivation of norms. 

Many standard tests began as informal objective tests made by class- 
room teachers. When an informal test has gone through the process of 
standardization, it then differs from the original class test in four essential 
aspects. In the first place, the content has been standardized. This means 
that each item has survived most careful scrutiny by a competent person, 
or more likely a group, and that its difficulty and value have been deter- 
mined by rigid experimental processes that have eliminated its weaker 
fellows. In the second place, its method of administration has been stand- 
ardized. This means that definite directions have been worked out, usually 
with appropriate time limits, and the like. In the third place, the method 
of scoring has been standardized. This means that scoring keys have been 

‘John C. Flanagan emphasizes this distinction on page 698 and elsewhere in Units, 


Scores, and Norms,” Chapter 17 in E. F. Lindquist (Editor), Educational Measurement. 
Washington, D. C.: American Council on Education, 1951. 


274 


THE USES AND LIMITATIONS OF NORMS 215 


prepared and that definite rules have been formulated for marking the 
papers and for determining the scores on each part and on the whole test. 
Finally, the process of interpretation has been standardized, at least in 
part. This means that tables of norms are now available for interpreting 
the various scores made on the test. These norms are merely scores which 
have been made by large numbers of pupils distributed over wide geograph- 
ical areas and representing various types of schools, and which have been 
grouped, as a rule, according to chronological age or school grade. 

Norms versus standards. The word standard implies a goal or objective 
to be reached. It should be clear, then, that a norm is not a measure of what 
ought lo be, a goal, but is merely a measure of what is, the status quo. When 
а grade or class is up to the national median on the test, it is just an average 
or typical group. Of course, it may be that this score represents a reasonable 
performance for the group under the circumstances, but that fact would 
have to be determined by further inquiry. The mere fact that the grade 
attains the norm does not of itself establish anything other than that the 
performance is that of a typical group. Manifestly a group of students 
having superior opportunities and capacities ought to make better than a 
typical record. On the contrary, a group of low ability and opportunity 
might find it virtually impossible to do that well. Unfortunately, at the 
present time not many tests have more than one set of norms for each 
grade or age group, all types of pupils and schools being lumped together. 
What is needed is a norm for at least each major type of school organization 
and type of pupil. Even then such norms could hardly be regarded as 
reasonable standards of attainment. For one thing, the norms of achieve- 
ment tests are never more than tentative. They must be continually chang- 
ing with increases in length of school term and with improvement in train- 
ing of teachers, in textbooks, in school equipment, and the like. It is also 
not unreasonable to assume, human nature being what it is, that average 
achievement with the facilities now available could be considerably better 
than exists at the present time. In a real sense the only valid norm for the 
individual pupil is his own past record, and the only valid standard is his 
maximum capacity for growth. 

Reasonable standards, or goals of attainment, are almost altogether 
lacking. It is conceivable that such standards might be worked out and 
expressed in numerical units on existing tests, or on others to be devised. 
But such a process is inherently difficult, whereas the process of building 
norms is time-consuming and laborious but perfectly simple and straight- 
forward. In fact, an adequate technique for establishing standards has yet 
to be worked out. Ideally, a standard would have to be provided for each 
individual. At any rate, no one standard could be established which would 


2 The achievement tests prepared by the Armed Forces Institute had separate norms 
for six geographical regions as well as for the country as a whole. See Educational Record, 


25: 369, October, 1944. 


N 


276 : THE TESTING PROGRAM 


be equally appropriate for everybody, or even for any considerable number, 
In view of such considerations as these, Wood has said :* 


As currently used, the word standard has no place in educational literature 
outside the perorations of convention orators. . . . 

Speaking more constructively, it is sufficient to point out that educational 
standards are necessarily individual, and in their fundamental nature are akin to 
the standards of tailors and shoemakers who judge the quality of their products 
by how well they fit the individual for whom they are intended . . . and how 
long they serve him. 


Swan has satirized the idea of a single uniform standard by imagining 
what would happen if all the tailors of the country got together and agreed 
upon a “standard suit."^ The distressing outcome is described? as follows: 

Instead of the old haphazard procedure, the standard suit was brought out when 
a man went into a tailor shop to get a new suit. If he did not fit the suit, he was 
rejected then and there. He was thus sentenced to join a nudist colony. Men goon 
learned that the only thing to do was to eat the right food and take the proper 
exercise to make them just fit the suit. . . . If he perchance ate something else 
than that required to make him fit the standard suit, he would be rejected, even 
though what he ate was better for him from the standpoint of health than that 
needed to get ready for the standard. 


The important thing to remember is that, for the present at least, such 
standards do not exist in any subject. Certainly an understanding of the 
way norms are determined would make it obvious that they lay no claims 
to being goals of performance, unless perchance one is willing to accept 
mediocrity as a goal. 


B. Raw Scores and Derived Scores 


What a score means. To take a simple case, let us suppose that a 
certain pupil has made a score of 40 on a spelling test of 50 words. What 
does this score of 40 mean? To say that the score represents an achievement 
of 80 per cent is true as far as it goes, but this obvious interpretation leaves 
much to be desired. As the problem of interpreting a given score in mean- 
ingful terms is fundamental in all measurement, it deserves careful con- 
sideration. 

A score on any test is simply a numerical description of an individual’s 
performance on that test. A distinction must be made between test perform- 
ance on the one hand, and ability and capacity on the other hand. Per- 
formance is merely evidence of ability or capacity, Ability refers to an 
individual's actual achievement at the present time, whereas capacity refers 

3 Ben D. Wood, “Basic Considerations," Review of Educational Research, 3: 13, Feb- 
ruary, 1933. 


*J. N. Swan, "Standardized Tests for Chemist hing,” S md Society, 
44: 275-377, ne ust 29, 1936. emistry Teaching,” School ana , 
5 [bid., page 275. 


THE USES AND LIMITATIONS OF NORMS 271 


to his potentialities. Since a test is always a sampling rather than a complete 
measurement, a pupil's response to the test situation is accepted as an ex- 
pression of his ability operating under a given set of conditions. But a poor 
score on a valid achievement test is not necessarily evidence of poor ability 
in that subject under any and all conditions. It may be due to any number 
of factors, such as physieal illness or discomfort, poor eyesight or hearing, 
emotional disturbance, or dislike for the teacher or subject. 

In like manner, a poor performance on even the best group test of intel- 
ligence available is not necessarily positive proof of a lack of what we call 
"general intelligence.” It may be due to any one factor or combination of 
factors mentioned above as operating in the case of achievement tests. 
In addition, there are several other factors that may be responsible, such 
as poor reading ability, inability to understand the English language, and, 
especially, inadequate learning opportunities in school and outside. For 
example, Wheeler® found that the average intelligence of Tennessee moun- 
tain children, as measured by two well-known group tests, was approxi- 
mately normal at six years. but that it showed a fairly consistent decrease 
with increases in chronological ages. The data warrant the significant con- 
clusion: 

The general trend of this investigation indicates that the results of both tests 
are materially affected by environmental factors, and that the mountain children 


are not as far below the normal as the tests seem to indicate. With the proper 
environmental changes the mountain children might test near a normal group. 


Ten years later Wheeler’ repeated the study in the same region which 
had shown “definite improvement in the economic, social, and educational 
status” during the intervening period. Although there was still a tendency 
for intelligence as measured by the tests to decline in the upper years, the 
average IQ was ten points higher than it had been a decade earlier, 

A study of Kentucky mountain children by Asher? revealed similar re- 
sults and led to the conclusion that a valid comparison of the intelligence 
of urban children and of children in less favorable environments “awaits 
more adequate measuring methods.” 

A group of researchers at the University of Chicago found that many 
intelligence-test items were answered correctly much more frequently by 
children from the higher socio-economic levels than by those from the lower 
ones, and they concluded that the tests were penalizing the latter young- 
sters for their lack of contact with middle-class culture and lack of appre- 


- 5L. R. Wheeler, “The Intelligence of East Tennessee Mountain Children,” Journal 
of Educational Psychology, 23: 351-370, May, 1932. м 

7L. В. Wheeler, “А Comparative Study of the Intelligence of East Tennessee Moun- 
tain Children," Journal of Educational Psychology, 33: 321-334, May, 1942, 

в E. J. Asher, “The Inadequacy of Current Intelligence Tests for Testing Kentucky 
Mountain Children,” Journal of Genetic Psychology, 46: 480-486, June, 1935. 


> 


278 THE TESTING PROGRAM 


ciation for middle-class behavior.? This interpretation has stirred up con- 
siderable discussion; many psychologists do not accept the Chicago group's 
conclusion that the items should be rewritten to be ‘fairer’ to lower-class 
persons.” 

The point is that capacity is always inferred from activity or perform- 
ance. The inference, for example, that two identical scores on an intelli- 
gence test really mean equal degrees of intelligence cannot be safely made 
unless it is known that the learning opportunities have been at least ap- 
proximately equal. A full realization of this fact would enjoin more caution 
than is often shown in the interpretation of scores on so-called tests of 
general intelligence. Trained examiners exercise care in observing rigidly 
controlled conditions for administering the tests and objective standards 
for scoring the papers, but it is often hard to be sure about the pupil’s past 
history, which may be reflected to some extent in his present performance, 

Raw scores versus derived scores. When a test paper has been marked 
according to instructions, the score obtained is called a raw score or crude 
score. On tests, as distinguished from quality scales, it is often called a 
point score, since the numerical description is in terms of points. On a scale, 
as for example the Ayres handwriting scale, the numerical description is 
hardly in terms of points but rather in terms of some arbitrary value as- 
signed to a rank or position. In the example given above, the pupil has a 
point score of 40 on the spelling test. In other words, 40 describes his per- 
formance on that particular test at the time it was administered. 

But a raw or point score by itself means very little. It is usually not pos- 
sible to compare a raw score on one test directly with a raw score on another 
test. The difficulty is that the units are not comparable. The problem is 
much like that imposed when adding 4, 2, 3, and $. It is first necessary to 
find a common denominator, in this case 12, and then to express all values 
in terms of that denominator. The problem is then simple: 


tet fet e+ ho -258-21 


® The most comprehensive report of these studies is Kenneth W. Eells and others, 
Intelligence and Cultural Differences, 388 pages, Chicago: University of Chicago Press, 
1951. Interesting background reading is Allison Davis, Social-Class Influences upon 
Learning; The Inglis Lecture, 1948, 100 pages, Cambridge, Mass.: Harvard University 
Press, 1948. 

19 Two essentially unfavorable reviews of Hells’ book are: John G. Darley, “Review: 
Intelligence and Cultural Differences,” Journal of Applied Psychology, 36: 141-143, 
April, 1952, with a reply by Eells, “Comment on Darley’s ‘Special Review,’ " Journal 
of Applied Psychology, 36: 422-423, December, 1952; and Quinn McNemar, “Review: 
Intelligence and Cultural Differences,” Psychological Bulletin, 49: 370-371, July, 1952. 
At the 1952 American Psychological Association convention in Washington, D.C., 
there was a symposium entitled “Implications of the Chicago Studies of Intelligence 
and Cultural Differences” (see the American Psychologist, 7: 295, July, 1952). Speakers 
were Eells, Roger T. Lennon, Irving Lorge, and John L. Stenquist. Also see “Techniques 
for the Development of Unbiased "Tests," pages 76-131 in Proceedings of the 1952 Invi- 
tational Conference on Testing Problems, Princeton, N. J.: Educational Testing Service, 
ras T apers were presented by Ernest A. Haggard of the University of Chicago, 
Unive sorge of Teachers College, Columbia University, and Phillip J: Rulon of Harvard 

niversity, with a written commentary by McNemar to which Haggard replied, 


THE USES AND LIMITATIONS OF NORMS 279 


To meet a similar need, test makers have found it necessary to determine 
common denominators for their test scores. These are called “derived 
scores." A derived score is a numerical description of a pupil's performance 
in terms of norms. 'The norm itself is the performance of a defined group 
considered to be typieal. For example, a pupil who answers correctly 22 
questions on the Thorndike-MeCall Reading Test has a reading ability 
which is that of the normal, or average, twelve-year-old child at the end 
of the fifth grade. 

Usually, with standard tests the norms used are either age norms or 
grade norms. The derived scores merely describe the individual’s position 
in some group. Sometimes the age norms are carried one step further and 
expressed in terms of quotients; that is, one age score is divided by another. 
With the exception of quotients, most derived scores are obtained from 
tables of norms in the test manuals, which give in parallel columns the 
derived scores equivalent to various point scores. As the problems of inter- 
pretation differ somewhat for achievement and intelligence tests, they will 
be treated separately in the next two sections. 


C. The Use of Norms in Interpreting Scores on 
Intelligence Tests 


Mental age versus intelligence quotient. The most commonly used 
units in which to express the results of an intelligence test are mental age 
and intelligence quotient, usually abbreviated MA and IQ." It is im- 
portant to understand the distinetion between them. MA is a measure of 
mental maturity and "indicates the level of development which a child has 
reached at a given time,” to use the words of Terman.” This degree of 
mental maturity or level of development is expressed in terms of that 
“possessed by the average child of corresponding chronological age." For 
example, a point score of 75 on the Terman Group Test of Mental Ability 
is equivalent to an MA of 13 years and 2 months, usually written 13-2. 
This means that when the Terman Test had been given to hundreds of 
children in various parts of the country it was found that the average 
score of a child with a chronological age (CA) of 13 years and 2 months 
was 75 points. Any child who makes a score of 75 on this test is said to have 
an MA of 13-2. But pupils of various CA’s make scores of 75 on the Terman 
Test. It is clear, therefore, that a 10-year-old child with an MA of 13-2 
has matured rapidly, whereas a 14-year-old child with an MA of 13-2 has 
matured at a much'slower rate. In other words, MA is a measure of stage 
or level of maturity but not of rate. Rate is indicated by the IQ, which is 
obtained by dividing the MA by the CA and multiplying the resulting 

11 Standard-score IQ's which do not involve mental ages are used with the Wechsler 
Intelligence Scales, both adult and child. They possess certain distinct advantages over 
the traditional IQ and are popular with individual testers. 

12 Lewis M. Terman, The Intelligence of School Children, page 7. Boston: Houghton 
Mi 6K 1919. 


280 THE TESTING PROGRAM 


decimal fraction by 100: IQ — 100 (a) In the preceding illustrations 
the IQ of the 10-year-old child whose MA is 13-2 would be: 


EJ 
aS MAY. 13-2) _ 13) _ 18k 
19-1 x) 100( 2) = 1 m) 1 тй, 
^ 10/13.17 


= 100 d) = пл = 182. 

The IQ, then, gives us a different interpretation of a score оп an intelli- 
gence test from that afforded by the MA. The JQ ts a measure of rate of 
maturity, whereas the M A is a measure of level or stage of maturity. In both 
cases rate and level are relative to the standardization group. If a child has 
matured rapidly, he is said to be bright; if he has matured slowly, he is said 
to be dull. A fuller interpretation is to be given a little later. For the present, 
it is sufficient to note that ordinarily both the MA and IQ of a pupil should 
be recorded, if available, for each has its distinctive values—and limita- 
tions. 

Advantages of the MA concept. The MA has certain outstanding 
values. Probably the chief of these is that it makes possible a comparison 
with achievement scores also expressed in age units, as well as with the CA 
of the pupil, so long as the derived scores are obtained for the same popula- 
tion or from comparable populations. The age basis of comparison is a 
much more stable unit than the grade location, which is greatly influenced 
by the promotion policies of the school. 

Limitations of the MA. There are also certain serious limitations of 
the MA, most of which apply particularly to the use of the concept in the 


high school. It has often been pointed out that the definition of MA does — 


not hold true for CA’s beyond 13 or 14. One reason for this is that the norms 
were based primarily upon pupils in school, who beeame an increasingly 
select group, the weaker ones tending to drop out. This was especially true 
when Terman was standardizing the original (1916) Stanford-Binet scale, 
against which many later tests were validated. Then, too, in spite of their 
appearance, the mental age units on the scale are probably of unequal 
length, the annual increments becoming smaller and smaller as they ap- 
proach maturity, when the curve flattens out altogether. But no way has 
been devised so far for equating these units, or for making satisfactory 
allowance for their variation in length. This is the principal reason why 
true growth curves of mental development are not obtainable up to the 
present, even when the same individuals have been measured repeatedly 
over a long period of years. This also complicates the problem of inves- 
tigating the constancy of the IQ, and of its computation in the later chron- 
ological ages. Neither of these limitations, however, is very serious in the 
elementary school. 

There is a more serious limitation, which appears to operate on, all age 


THE USES AND LIMITATIONS OF NORMS 281 


levels: namely, that the mental age units on one test are not fully compa- 
rable to those on another test. It is, of course, entirely possible that im- 
perfect standardization is largely responsible. But whatever the explana- 
tion, it is clearly necessary in reporting intelligence test scores to indicate both ™ 
the name of the test and the form used. 

It may, of course, be true that under the circumstances the terms MA 
and IQ are not particularly fortunate when used to describe the scores on 
existing tests. Be that as it may, the users of these tests should frankly 
recognize such limitations as exist. It is a curious fact, however, that people 
are just as loath to recognize the limits of their brain children as are parents 
to recognize the limits of their flesh-and-blood children. The difficulty of 
arriving at a rational interpretation of a low score on an intelligence test 
may as well be due to myopia on the part of the interpreter as to that con- 
dition in the parent whose child received the score. Boynton™ recommends 
what he calls a “pragmatic attitude” toward the tests, for the facts are 
that “in a vast majority of cases they work with a high degree of success." 

After all, in spite of certain definite limitations, intelligence tests, when 
intelligently used, do afford valuable information to classroom teachers and 
school administrators. So long as that is true, whether they measure intel- 
ligence or something else, whether the age score is really mental age or only 
personal age, would appear to be primarily a matter of academic interest 
only. 

One other limitation of MA and all other gross units is that by lumping 
together many elements they obscure significant differences. Two children 
of the same CA might have an MA of 8 years and yet be quite unlike. 
One child might be unusually strong in the linguistic elements of the test, 
but lacking in the more concrete, practical, or common-sense elements, 
while just the reverse might be true of the other child. This means that the 
pattern of the test responses, as well as the total or average, must be con- 
sidered. This, of course, does not mean that the total score has no value, 
but rather that by itself it is inadequate, especially for diagnosis and guid- 
ance. The practical suggestion, then, is to consider the pattern as revealed 
by the profile, as well as the total score, be that an age score or what not. 
As Thurstone says, “Each individual should be described in terms of a 
profile of mental abilities instead of by a single index of intelligence.” 

The computation of the IQ. As ordinarily written, the formula for 
the IQ is 100 (aur That is, the IQ is the quotient obtained by 
dividing the mental age of the pupil by his chronological age at the time 
the test was given. In other words, it is the percentage that the mental age 


! Paul L. Boynton, Intelligence, Its Manifestations and Measurement. pages 231-234, 
New York: D. Appleton-Century Company, 1933. y 

1 L, L. Thurstone, “A New Concept of Intelligence and а New Method of Measuring 
Primary Abilities,” Educational Record, 17: 133, Supplement No. 10, October, 1936. 


282 THE TESTING PROGRAM 


is of the chronological age. As a matter of fact, however, the CA used as 
a divisor is never more than the age at which the test maker assumes mental 
maturity is reached. Upon the basis of the evidence available in 1916, 
Terman” suggested that a divisor of 16 years be used for all pupils whose 
CA is 16-0 or above. In the Revised Stanford-Binet, Terman and Merrill” 
suggest this rule: 

Up to 13-0 the entire C.A. is counted; beyond 16-0, none of it. The C.A. of a 
subject who is between the ages of 13-0 and 16-0 is counted as 13-0 plus 3 of 
the additional months he has lived. This means that a true С.А. of 14 is counted 
as 13-8; в true С.А. of 15 as 14-4; and a true С.А. of 16 as 15-0, which is the 
highest divisor used in the computation of the 1.Q.8 


This suggestion would appear to be in keeping with the fact that mental 
maturity is reached gradually rather than abruptly. The age at which it is 
attained probably varies considerably from test to test. Many writers 
favor using percentiles or standard scores, rather than IQ's, especially be- 
yond the elementary school.!? 

The actual work of computing IQ's can be greatly reduced by the use 
of tables such as those in Terman and Merrill.? A preceding chapter em- 
phasized the need for a careful checking of the scoring and totaling of all 
scores and the obtaining of the MA or other equivalents from the tables 
of norms in the manuals. There is one other step in computing the IQ, even 
with the use of tables, that must be watched carefully to insure aceuracy. 
That is the determining of the CA of the pupil. In the lower grades this 
age score should be taken from the date of birth as shown on the school 
records, which in turn should be based upon a birth certificate. A young 
child is likely to put down 9 when he is merely "going on 9," for example. 
In the upper grades it is usually safe to rely on the pupil’s answer as given 


1° Lewis M. Terman, The Measurement of Intelligence, pages 140-141. Boston: Hough- 
ton Mifflin Company, 1916. 

17 Lewis М. Terman and Maud А. Merrill, М. easuring Intelligence, page 68. Boston: 
Houghton Mifflin Company, 1937. 

18 In symbols the three formulas are: 


ІО (сл tess than 13-0) = 100 (aa) ai NS 
MA ] 150МА 


13 + (СА — 13) | ^ 65 + CA 


For examinees 16 years old or older, the formula becomes 


IQ(ca-1e-0ormor) = 100 m = 6.67MA 

These formulas apply only to the Revised (1937) Stanford-Binet Intelligence Scale, 
Forms L and M. The appropriate denominators for other tests may be quite different. 

?* The Wechsler Intelligence Scale for Children (WISC) has standard-score “IQ's” 
for 33 age levels from 5-0 through 15-11. David Wechsler, Manual for the Wechsler 
Intelligence Scale for Children, pages 27-59. New York: The Psychological Corporation, 
1949. 
* Lewis M. Terman and Maud A, Merrill, op. cit., pages 417—450. 


І (са =13-16) = 100 [ 


THE USES AND LIMITATIONS OF NORMS 283 


on the test blank. On most tests he is asked to give his age at his last 
birthday, and then to give the year, month, and day of his birthday, or 
else to tell how many months it has been since his last birthday. The trouble 
usually comes with computing the months. This computation should al- 
ways be checked, preferably by a simple table prepared by the examiner. 
Table 31 illustrates such a table, prepared for a test given on May 21, from 
which the months can be read directly. It is desirable to verify the years 


TABLE 31 


A Тлвте ror Computinc Montus Since Last BIRTHDAY 
(Dare or Test, May 21) 


Birthdays Between Dates Months Since Birthday 
January 6 and February 5............... 4 
February 6 and March 5 3 
March 6 and April 5...... 2 
April 6 and May 5...... eee n 1 


May 6 and June binu ЕЕЕ 0 May 21, Test Date 
June 6 and July 5....-.... eere 11 
July 6 and August 5. ....... eee 10 
August 6 and September 5.......... see 9 
September 6 and October 5.............: 8 
October 6 and November 5,............. 7 
November 6 and December 5............ : 


December 6 and January 5........+----- 


for those pupils whose birthdays come in the month the tests are given or 
in the next month or so; for even high-school pupils will often make an 
error of one year. When the correct MA’s and CA's are determined, the 
IQ values can be read from a table. If the IQ's are computed by actual 
division, it is necessary to have the work done twice independently. 

Interpretation of the IQ. The IQ is a measure of brightness, or of rate 
of intellectual development. Following the lead of Terman, many writers 
consider IQ’s of 90 to 110 as “normal,” those below as subnormal, and 
those above as supernormal. According to this scheme, IQ's below 70 may 
indicate ‘feeblemindedness.” Individuals in this group are often subdivided 
into three types or levels of feeblemindedness: idiots, below 25; imbeciles, 
25 to 49; and morons, 50 to 69, inclusive. Most clinicians recognize these 
as rather rough and arbitrary groupings and attempt to apply other cri- 
teria, such as social sufficiency or success in school. Feeblemindedness is a 
psychological-social-medical-legal concept. 

There is a continuous distribution from the idiot to the genius, and the 
various degrees of brightness shade into each other until they are as in- 
distinguishable as the borderline between colors of the rainbow. It is easy 
to see that red is different from violet, or to see the difference between red 


284 THE TESTING PROGRAM 


and yellow; but it is hard to tell where orange leaves off and becomes red 
on the one hand or yellow on the other. The concept of "genius" is worthy 
of further consideration. Following the lead of Terman, it has been common 
to interpret any IQ of 140 or above as indicating "genius or near genius." 
Evidence is accumulating which indicates that this limit is much too low. 
In an illuminating discussion of this problem, Hollingworth comes to the 
conclusion that a minimum IQ of 170 or 180 is more defensible, and that 
works of genius are conditioned by high ability when combined with zeal 
and hard work.” Terman supports the conclusion that “above the IQ level 
of 140, adult success is largely determined by such factors as social adjustment, 
emotional stability, and drive to accomplish." 2 

Advantages of the IQ. The identification of various degrees of bright- 
ness is one of the advantages of the IQ. Moreover, many studies have 
shown the IQ to remain relatively constant under ordinary conditions from 
year to year, although radical changes in the home and school environment, 
which rarely occur, are likely to be reflected in larger changes in IQ when 
they do occur. Nemzek? summarized 97 studies which used the 1916 
Stanford-Binet test and 27 studies which used group tests. The median 
correlation coefficient by the test and retest method was .832 for the indi- 
vidual test and .846 for the group tests. The corresponding range of the 
middle 50 per cent of the coefficients was .760 to .889 and .779 to .885, 
respectively. The similarity of results for the individual and group tests is 
remarkable. But these correlations permit considerable variations, which 
apparently are more likely to oceur at the extremes of the distribution 
than near the center.” There seems to be a tendency for the lower IQ's to 
decrease somewhat on later tests, while the evidence for the higher IQ's is 
somewhat contradictory. After six years Terman found that 73 of his 
"geniuses," still below a CA of 13, had lost in Stanford-Binet IQ, the boys 
3 points and the girls 13 points on the average. On the other hand, Cattell 
at Harvard found that children with IQ's above 120 gained approximately 
8 points in three to six years. Most studies have noted a regressive effect, 
however. But, of course, most cases tend to cluster rather closely about 


21 Leta S. Hollingsworth, Children over 180 IQ, Stanford-Binet. Origin and Development. 
Yonkers-on-Hudson, N. Y.: World Book Company, 1942. 

22 Thirty-Ninth. Yearbook of the National Society for the Study of Education, Part I, 
page 84. Bloomington, Illinois: Public School Publishing Company, 1940. Quoted by 
permission of the Society. 

? Claude L. Nemzek, “The Constancy of the I.Q.,” Psychological Bulletin, 30: 154, 
February, 1933. A later review is: Robert L. Thorndike, “ ‘Constancy’ of the IQ,” 
Psychological Bulletin, 37: 167-186, March, 1940, For a claim that 1Q’s are extremely 
stable from the first grade to the freshman year of eollege see G. L. Brown, “On the 
Constancy of the I. Q.," Journal of Educational Research, 44: 151-153, October, 1950. 
A critical rejoinder is Julian C. Stanley, “A Note. Concerning Brown’s ‘On the Constancy 
of the T.Q., " which is scheduled to appear in the Journal of Educational. Research. 

* Both statistical evidence and clinical experience seem to agree that the Revised 
Stanford-Binet is particularly retiable at the lower IQ levels. See Quinn MeNemar, The 
Revision of the Stanford-Binet Scale, page 13. Boston: Houghton Mifflin Company, 1942. 


THE USES AND LIMITATIONS OF NORMS 285 


the center of the distribution, where the IQ is “fairly stable." After a sum- 
mary of the experimental evidence, Cattell arrives at this practical con- 
clusion :* 

The results are reported as evidence of the large changes in the IQ which do 
occur in ordinary sehool practice and to emphasize the caution with which the 


results of a single intelligence test must be interpreted, even thougb it be an 
individual examination made by an expert. 


Since the IQ of the average pupil is likely to be relatively stable, if orig- 
inally computed for CA’s between 7 and 13 years, his MA at a later age 
can be estimated with fair assurance from his present CA and his IQ. For 
example, suppose that a pupil who had an IQ of 95 when in the third grade 
is now in the fifth grade. His present CA is 10-2, or 10.17 years. His esti- 
mated MA in years is 10.17 X .95, which is 9.66 = 9 = 9% = 9-8. 
Comparisons with achievement test scores expressed in ages can, therefore, 
be made wherever such comparisons are thought desirable, without the 
necessity of repeating the intelligence tests at the same time, Although 
it is desirable to repeat intelligence tests until at least three tests have been 
given during the pupil’s educational career, the tests need not be given 
at the same time as the achievement tests in order to make comparisons. 
The IQ from a test given by an expert to pupils in the publie school may be 
regarded as sufficiently constant to make adjustments for a different date 
fairly safe, at least for a period of two or three years. 

Limitations of the IQ. In common with all units in which test scores 
are expressed, the IQ suffers from two limitations. The zero point is arbi- 
trary rather than real, and the various units are of unequal length or value. 
The difference between 60 and 70 is not equal to the difference between 
90 and 100, or to the difference between 120 and 130. In the same way it is 
absurd to say that a pupil whose IQ is 120 is twice as bright as one whose 
IQ is 60, or half again as bright as one whose IQ is 80. But in this regard 
the IQ is no worse than are all raw test scores and practically all other 
derived scores. For example, it is obvious that when the thermometer reg- 
isters 10 degrees below zero it is not twice as cold as when the thermometer 
registers 5 degrees below. 

There is also another serious limitation of the IQ. Many studies have 
shown that the IQ’s on one test аге not comparable to those obtained on 
another test. This was shown clearly in Table 24 on page 110, where the 
mean IQ's on five intelligence tests ranged from 96.4 to 118.2 for the same 
284 high-school seniors. The ranges of individual scores on the five tests 
are not reported, but it is safe to infer that some of them must have been 
quite large. ТО’ from different tests must be equated in order to make 
them comparable, for even when average IQ's on different tests are close 
together the extremes are likely to vary widely. One solution which has 

25 Psyche Cattell, “Stanford-Binet IQ Variations," School and Society, 45: 615-618, 
May 1, 1937. : 


286 THE TESTING PROGRAM 


TABLE 32 


TABLE rog Equatine INTELLIGENCE QuoTIENT VaruEs** 


(Use only with cases which were sixteen years 
of age or over when tested.) 


Cor- | IQ on IQ on CTM M* IQ on Cor- 
rected | Otis г IQ on I Q on SRA rected 
IQ |Higher, Non- uu sia. Nm- | M 
Value |Form A Total Language Language|McNemar a Verbal | Value 

90 88 96 87-88 96 87 75 97 90 
91 89 97 89-90 97 88 76 98 91 
92 90 98 91-92 89 77 99 92 
93 91 99 93-94 98 90 78 100 93 
94 100 95-06 91 79 101 94 
95 92 | 101 97-98 99 92 80 102 95 
96 93 | 102 99-101 93 81 103 96 
97 94 | 103 102-105 | 100 94 82-83 | 104 97 
98 95 | 104 106-108 95 84 105 98 
99 96 | 105 109-110 | 101-102 | 96 85-87 | 106-107 99 
100 97 | 106 111-112 | 103-104 | 97 88 108-109 | 100 
101 98 107 113 105 98 89 110-111 101 
102 99 | 108 114-116 | 106 99 90 112 102 
103 100 | 109 117-118 | 107 100 91 113 103 
104 101 | 110-111| 119 101 92 104 
105 102 | 112 120 108 102 93 114 105 
106 103 113 121-122 | 109 103 94 115 106 
107 114 123 104 95 116 107 
108 104 | 115 124 110 i08 
109 105 | 116 125-126 | 111 105 96 117 199 
110 106 | 117 127-128 | 112-113 | 106-107 | 97 118-119 | 110 
11 107 118 129-130 | 114 108 98 ,120 11 
112 108 119 131-133 | 115 109 99 121-122 112 
113 109 120 134-136 | 116-117 | 110 100-101| 123-124 113 
114 110 | 121 137-139 | 118 11 102-103} 125-126 | 114 
15 111 | 122 140-141 | 119 112 104 127 115 
116 112 | 123 142-143 | 120 113-114 | 105-106| 128 116 
117 113 124 144 121 115 107 129 117 
118 114 125 145 116 108 130 118 
119 115 126 146 122 117 109 131 119 
120 116 127-128] 147 123-124 | 118 110-111} 132-133 120 
121 117 | 129 148 125-126 | 119 112-113| 134-135 | 121 
122 118 | 130 149 127-128 | 120 114 136-137 | 122 
123 131 129 121 115 138-139 123 
124 119 132 150 130 122 116 140 124 
125 120 | 133 151 131 123 117 141-142 | 125 


* These values apply only to the California Test of Mental Maturity, Short Form, 
1942 edition. 


* This is a slight modification of Table XIV in an unpublished report by Walter б. — - 
Heil and Alice Horn, “A Comparative Study of the Data for Five Different Intelligence 
Tests Administered to 284 Twelfth Grade Students at South Gate High School— 
Los Angeles." Los Angeles: Curriculum. Division, Los Angeles City School Districts, 
February, 1950. (Mimeographed.) 


THE USES AND LIMITATIONS OF NORMS 287 


been proposed is to equate all tests in terms of some widely used test.” 
Another procedure is to equate several tests locally. This was done by 
Heil and Horn, as shown in Table 32, for the 284 second-semester twelfth- 
grade students in a Los Angeles, California, high school. Their five intelli- 
gence tests were the Otis Self-Administering Test of Mental Ability, Higher 
Examination, Form A; the California Short-Form Test of Mental Ma- 
turity, 1942 edition; the Terman-McNemar Test of Mental Ability; the 
Science Research Associates Primary Mental Ability tests; and the SRA 
Non-Verbal form. The bold-face IQ's going from 90 to 125 on each side 
of T'able 32 are corrected values. Note, for instance, in the 117 row that 
actual IQ's on the various tests range from 107 to 144. At 125 they run 
from a low of 117 to a high of 151. 

Lennon?! has provided tables yielding equivalent scores and equivalent 
IQ's for the Otis Quick-Scoring Mental Ability Tests: Gamma Test, Form 
Am or Bm; Pintner General Ability Tests: Verbal Series, Advanced Test, 
Form A or B; and Terman-McNemar Test of Mental Ability, Form C 
or D. An IQ of 100 on the Otis is equivalent to 99 on the Pintner and 102 
on the Terman-McNemar. Corresponding to an Otis IQ of 130 is a Pintner 
IQ of 134 and a Terman-MeNemar IQ of 138. At the lowest IQ level shown 
in Lennon's Table III are the following figures: Otis, 76; Pintner, 64; and 
Terman-McNemar, 66. Therefore, among these three tests issued by the 
same publisher discrepancies at the “average” IQ of 100 are slight, but 
differences may be serious for low or high scorers unless the recommended 
conversion is made. The Heil-Horn and Lennon studies point up the faet 
that an individual may have as many IQ's as there are different intelligence 
tests. 

Doubtless the fundamental solution is for all test makers to standardize 
their tests, whether they aim to measure intelligence or achievement, on a 
national population so chosen as to conform fully to the mathematical 
theory of sampling. As long as tests continue to be standardized on samples 
chosen primarily upon the basis of convenience, even when they involve 
large numbers and wide geographical areas, there is still no assurance that 
the samples are truly representative and thus comparable with each other. 

Because of its numerous limitations some authorities would abandon the 
IQ concept altogether. Stoddard,” for example, characterizes it as a “myth” 
pure and simple. No one recognizes the limitations of the IQ more clearly 
than its friends, as this statement from Terman” indicates: “An obtained 


27 W, S, Miller, “Variation of IQ's Obtained from Group Tests," Journal of Educa- 
tional Psychology, 24: 468-474, September, 1933. А - 

?! Roger T. Lennon, “A Comparison of Results of Three Intelligence Tests," Tesi 
Service Notebook No. 11. Yonkers, N. Y.: World Book Company, 1951. 4 pages. 

2° George D. Stoddard, The Meaning of Intelligence, page 258. New York: The Mac- 
millan Company, 1943. і у ў 

20 Thirty-Ninth Yearbook of National Society for the Study of Education, ор. cit., 
page 466. 


* 
288 THE TESTING PROGRAM 


1.Q. is not only subject to chance errors resulting from inadequate sampling 
of abilities, but also of numerous other errors, including practice effects, 
negativism or shyness, the personal equation of the examiner, and stand- 
ardization errors in the test used." 

All things considered, the authors are disposed to agree with Terman and 
Merrill? that the sensible thing to do under the circumstances is to “employ 
the simplest indices available and as rapidly as possible acquaint teacher, 
school counselors, social workers, and physicians with their significance and 
their limitation." The MA and the IQ are examples of such ‘‘simple in- 
dices." However, amateur test users will do well to remember at all times 
Hildreth's warning that “по one I.Q, ever indicates exactly any child's 
tested ability."*? No matter how obtained, the IQ should never be accepted 
as the final verdict, but rather as a point of departure for further investi- 
gation. 

Other derived scores. To avoid the difficulties in the MA and IQ, 
other types of derived scores have been proposed. Of these, the three most 
common will be considered briefly. It must be apparent at the outset, how- 
ever, that no norm can be any better than the sample upon which it is based 
or the measuring instrument employed. Errors of sampling and of meas- 
urement cannot be avoided by the simple device of shifting the unit in 
which to express the norms. The Cooperative tests have moved in this 
direction by taking as а point of reference the *50-point," the score ‘їп- 
tended to represent the score which the average white child in the United 
States would make at the end of the particular course if he had attended 
a typical school and had had the usual instruction in the subject in ques- 
tion." 9 

The Personal Constant (PC) has been suggested by H. Heinis as a sub- 
stitute for the IQ. Kuhlmann was so convinced of the merits of this method 
that he included a table of Heinis Mental Growth Units, which he recom- 
mended in place of the IQ, for use with the Kuhlmann-Anderson tests. 
On the other hand, Cattell^ found the PC more constant than the IQ for 
pupils of low intelligence but not for those of high intelligence. The PC has 
not received wide acceptance. 

A second substitute, proposed by many writers, is the percentile rank. 
A percentile rank is a description of a pupil's position in a typical age or 
grade group in terms of the percentage of pupils who fall below that score. 
A percentile rank of 50 would, of course, be exactly at the median. In like 


3! Lewis M. Terman and Maud A. Merrill, op. cit., page 29. 

32 Gertrude Hildreth, “Stanford Binet Retests of Gifted Children," Journal of Educa- 
tional Research, 37: 301, December, 1943. 

з John C. Flanagan, The Cooperative Achievement Tests, A Bulletin Reporting the 
Basic Principles and Procedures Used in the Development of Their System of Scaled 
Scores, page 19. New York: Cooperative Test Service, 1939. 

34 Psyche Cattell, “Тһе Heinis Personal Constant as a Substitute for the IQ," Journal 
of Educational Psychology, 24: 221-228, March, 1933. k 


* 
THE USES AND LIMITATIONS OF NORMS 280 


manner a percentile rank of 10 would show that in a typical group only 10 
per cent make a poorer score than that, while a percentile rank of 90 would 
mean that only 10 per cent make better scores than that, since 90 per cent 
fall below. This is a very simple and useful system that is widely used for 
achievement tests also, but it has two limitations. One is that a percentile 
rank of a given magnitude in one group is not directly comparable with 
the same percentile rank in another group. A 10th-percentile pupil in the 
freshman class is manifestly not the same as à 10th-percentile pupil in the 
senior class, for example. À second limitation is that the percentile rank 
units are of unequal length. For example, in a typical group an IQ of 62 has 
a Ist-percentile rank and an IQ of 68, which is an increase of 6 points, has a 
3rd-percentile rank; but another IQ change of 6 points, from 85 to 91, 
raises the percentile rank 11 points. Tn other words, the distances between 
percentiles near the center of the group are much less than those at the 
extremes. 

A third procedure is the method of standard scores, sometimes called 
sigma scores ог 2-scores, used by Stutsman in her Merrill-Palmer perform- 
ance scale and by Wechsler in his scales. These units are expressed in terms 
of the mean and standard deviation of the typical age or grade group or, 
for that matter, of any group. An illustration will help to make the system 
clear. Suppose that a pupil makes 40 points on one test and 80 points on 
another test. It is clearly unsafe to say that he did better on one test than 
on the other. This is evident if we find that the mean score on the first test 
is 30 points and that the mean score on the second is 90 points. In other 
words, the pupil is 10 points above average on one test and 10 points below 
average on the other test. To reduce these scores to a common denominator 
requires one additional step: namely, to take into consideration the vari- 
ability of the scores as well as their central tendency. Suppose, then, that 
the standard deviation of the first is 10 points and that of the second test 
is 20 points. It is now clear that our pupil is 1.0 standard deviation distance 
above the mean on one test and .5 standard deviation distance below the 
mean on the other. These two figures are standard or sigma scores and are 
written --1.0c and —.5e. To avoid negative numbers, the suggestion is 
sometimes made that the mean. score be called 50 arbitrarily and each 
standard deviation distance above and below be equivalent to 10 points. 
In our illustration opposite, the pupil's scores would be 50 + (1 X 10) 
= 60, and 50 — (0.5 Х 10) = 50 — 5 = 45. Я и 

The system has much to commend it statistically. In fact, its principal 
limitation is that it appears to be rather cumbersome to handle. That im- 
pression is, however, probably due more to its unfamiliarity than to any- 
thing else. Some writers 35 point out that not only are MA's and IQ's defined 
in the usual fashion indeterminate for the upper half of the adult popula- 

35L, L. Thurstone and Thelma Gwinn ‘Thurstone, “Psychological Examinations, 1940 
Norms,” American Council on Education Studies, 5: 2-3, 1941. 


290 THE TESTING PROGRAM 


tion, but they also argue that standard scores or percentile scores yield 
much more information even for young children. Figure 35 shows the rela- 
tion between standard scores, percentile scores, and Revised Stanford- 
Binet IQ's. It will be noted that the IQ’s on the Revised Stanford-Binet 
may be considered roughly as standard scores whose mean is 100 and whose 
sigma is 16. 


$.5. -300 -2.5g -200 -i5g -l0g -057 М +050’ +.00 +1.50 +2.00+2.50 +307 S.S 
РЕ. 1 6 23 67 159 309 500 69! 641 933 977 994 999 PR. 
1.0. 52 60 68 76 84 92 100 108 16 124 132 140 148 1.0. 


Figure 35. The Relation Between Standard Scores, Percentile Ranks, and Revised 
Stanford-Binet IQ's. (Based on Terman and Merrill, Measuring Intelligence, page 42.) 


Regardless of the type of norms used, the teacher must never lose sight 
of the fact that all measurement is subject to error and that scores can 
rarely be taken at face value. Some persons are so impressed by the “ubiq- 
uitous probable error,” to use Kelley’s phrase, that they think numerical 
scores of every kind "convey an unwarranted impression of exactitude,” 
and would report the results of intelligence tests in general terms, such as 
“dull,” “normal,” or “bright.” ® In the writers’ judgment, a better prac- 
tice is to continue to employ the numerical scores but to be keenly aware 
of their limitations. 


D. The Use of Norms in Interpreting Scores on 
Achievement Tests 


Educational age versus educational quotient. In interpreting scores 
on achievement tests the terms educational age and educational quotient are 
sometimes used in just the same way that mental age and intelligence quo- 
tient are used in interpreting scores on intelligence tests. In other words, 
educational age, or EA, is a measure of educational maturity, or level or 
stage of educational growth. In like manner the educational quotient, or EQ, 
is a measure of rate of educational growth or development. The EQ is 

°° H. Е. Garrett, “The Standardization of the Terman-Merrill Revision of the Stanford- 


Binet Scale,” Psychological Bulletin, 40: 196, March, 1943, Also see Psychological Bulletin, 
43: 72-76, January, 1946. 


THE USES AND LIMITATIONS OF NORMS 291 


obtained by dividing the EA by the CA. For example, a 10-year-old boy 
has made a score of 60 points on a certain achievement, test, which is the 
average score for a 12-year-old pupil. The boy is then said to have an EA 
of 12-0, which, divided by 10-0, gives him an EQ of 120. In like manner 
another 10-year-old boy in the same class might make a score of 35 points, 
which is the average score for a pupil of 8 years and 6 months. His EA is 
8-6, and his EQ is 85. It should be noted that the terms EA and EQ refer 
io scores made on general achievement tests or on test batteries involving 
several subjects. If a test in only one subject is used, the terms subject age 
and subject quotient are employed. For example, a reading test would yield 
reading ages and reading quotients, while an arithmetic test would yield 
arithmetic ages and arithmetic quotients, and so on for the other subjects. 

Uses of EA. The value of EA and of the various subject ages is that 
they make possible a meaningful interpretation of scores in terms of a 
relatively stable unit, chronological age. They also facilitate important 
comparisons with norms, on both intelligence and other achievement tests, 
whenever they have been standardized on comparable groups, as well as 
with the individual’s own MA and CA. 

Limitations of EA. EA and all subject ages have many of the limita- 
tions already pointed out in the case of the MA. Probably the most serious 
is that they reflect the promotion policies and holding power of the schools 
in which the tests are given. It is a matter of common observation that the 
performance of a 10-year-old pupil who is retarded in the grade is not the 
game as that of a 10-year-old pupil who has made normal progress, and 
much less than that of the accelerated pupil of the same age. Crawford *' 
has made an extensive study of the influence of such factors upon norms 
based on unselected groups, and comes to this significant conclusion: “The 
factors of chronological age, mental age, and rate of progress affect test 
norms to a degree that makes the use of norms based on groups in which 
these are not controlled of doubtful value." His recommendation is that 
both CA and MA should be used in establishing norms. One solution to the 
problem is to use only pupils whose CA's are normal for their respective 
grades in computing norms. 

One other limitation of existing age norms is that age units on one test 
are not comparable to those on other tests that are presumably measuring 
the same thing. Test publishers owe a service to the public and should 
prepare tables for equating age norms on various achievement tests in much 
the same way as has been done by the Cooperative Achievement "Tests? 
and the various parts and forms of the Metropolitan and Stanford Achieve- 
ment Tests.” Another less serious limitation is that the age units on 


31 John В. Crawford, “Age and Progress Factors in Test Norms,” University of Iowa 


Studies in Educution, 9: 1-39, June 15, 1934. à ; А 
35 Published by the Cooperative Test Division of Educational Testing Service. 


? Published by World Book Company. 


292 THE TESTING PROGRAM 


any one scale or test are not necessarily equivalent throughout its length. 

An important limitation of the EA and all other gross units is that they 
lump together many and diverse elements in such a way as often to obscure 
significant differences. Two pupils of the same CA or MA might have an 
EA of 10-0, for example. This does not guarantee that they are by any 
means identical in achievement. One pupil might be greatly accelerated 
in reading, language, and literature, but retarded in arithmetic, spelling, 
and science, whereas the exact opposite might be true of the other pupil. 
EA, which is a composite, or average, has taken no account of the pattern, 
which may afford the key to an adequate interpretation. This fact, of 
course, does not mean that age scores and other averages have no value, 
but rather indicates that they are inadequate by themselves. The practical 
implication is elear. The total score, whether age or what not, is important, 
but must be considered always in relation to the pattern of the responses, 
usually best represented as a profile. Figure 36, showing two profiles drawn 


READING ACCURACY. 80 90 10-0 10 12-9 


19. EQ. Ave. 
CHRONOLOGICAL AGE-~ 
MENTAL АСЕ 
ACHIEVEMENT AVERAGE- F . B À И; 


J Scores 7-0 


HEALTH KNOWLEDGE--------- B D * oe aaie o e 
LANGUAGE USAGE-—---- " 

HISTORY AND CIVICS 
СЕООВАРНҮ--------2.. 
ELEMENTARY SCIENCE 


Represent Grade Standard by vertical line 
* Period between years and months indicates Grade Score; dash, indicates Age Score 


Figure 36. The Profiles of Two Pupils Who Made the Same Total Score on a Gen- 
eral Achievement Test. (From the Modern School Achievement Tests, published by 
Bureau of Publications, Teachers College, Columbia University.) 


upon the same chart for pupils making the same total score, should make 
this point clear. 

Use and limitations of EQ. The method of computing the EQ and 
the various subject quotients has already been described. As measures of 
rate of educational progress these quotients may be useful in the interpreta- 
tion ef scores on achievement tests. However, no such elaborate scheme 
for the interpretation of these quotients as was described for the interpre- 
iation of quotients on intelligence tests has been worked out. There is 


THE USES AND LIMITATIONS OF NORMS 293 


nothing which corresponds to such terms as “feeblemindedness”’ or *'gen- 
jus." 

EQ's are subject to some, but not to all, of the limitations pointed out 
for IQ's. Since educational growth continues at least throughout the formal 
school period, there is no problem of selecting a maximum divisor, such as 
was described in the ease of IQ. It is true that the units are of unequal 
length and that the quotient technique is not appropriate for use in high 
school, where age norms on achievement tests are ordinarily not available. 
Undoubtedly the most serious limitation of quotients, as well as of other 
norms, is that the units on one test are not directly comparable with those 
on another test that purports to be measuring the same thing. Tables for 
equating quotients on achievement tests similar to those for intelligence 
tests have not been published. Better still would be the exercise of greater 
care in the original standardization. The test record should always indicate 
the name of the test and the form used, for achievement tests as well as 
for intelligence tests. 

Use and limitations of grade norms. It is a very common practice 
to interpret achievement tests in terms of grade norms. Grade norms on 
standard tests are usually the average scores made on the test by pupils 
in each grade when the test has been given to pupils in widely scattered 
areas. In the earlier tests these grade norms were usually for the end of the 
grade only, although sometimes for the middle of the grade also. This made 
comparison with norms somewhat difficult, unless the tests were given 
at the same time in the year. Figure 31, on page 267, illustrates a simple 
graphical method of making comparisons with norms for a different time 
in the year. Of course, such a comparison assumes uniform progress 
throughout the grade, which may be only approximately true in some 
instances. A slight variation of such norms for high-school use is to base 
the norms upon the length of time the subject has been studied rather than 
upon the grade or year in which it happens to be offered. Since many high- 
school subjects do not continue over the entire high-school period and have 
no definite grade location, norms based on the number of semesters the 
subject has been studied are very useful. The problem of interpretation is 
just the same as for regular grade norms. 

_ In recent years many tests below the senior high-school level have norms 
available for every month in the school year. For example, 6.0 means the 
norm for the beginning of the sixth grade, while 6.5 is the norm for the 
middle of the grade. In like manner 4.2 means the norm for the fourth 
grade two months after school starts, and 4.10 means the norm for the end 
of the fourth grade. Such norms are often called G-scores, and sometimes 
В-зсогез. They have the distinct advantage of being readily understood. 
They also have certain dangers and limitations. For one thing, they tend 
to imply a degree of mathematical exactness which the accuracy of existing 
tests hardly warrants. Certainly it is unsafe to take them literally at their 


294 THE TESTING PROGRAM 


face value. A still more serious limitation is the lack of comparability of 
scores on different tests. Adams found,” for example, that eight arithmetic 
tests rated the mean performance of 152 pupils all the way from the fifth 
grade to the eleventh grade, depending upon the test used. It is unnecessary 
to comment upon the absurdity of fractional-grade riorms in a situation 
like that. The solution is, however, not so much in the abandonment of 
grade norms as in their further refinement. There is another danger in in- 
terpreting grade norms, no matter how accurately determined. This danger 
arises partly from the fact that a school with an overstrict promotion policy 
will tend to show up favorably on grade norms simply because of the pres- 
ence of a great many pupils in the several grades who really belong in 
higher grades. It is always well, therefore, in any apparently superior 
school, to make a comparison on the basis of age, to see whether the supe- 
riority is real or only illusory. Of course there would be little difference 
between schools which promote strictly on the basis of CA and those in 
which the percentages of acceleration and retardation are balanced, a con- 
dition which rarely exists. 

Ruch and Segel, after noting some evidence that “recent tests may pos- 
Sibly have much more dependable norms than those standardized a decade 
or so earlier," nevertheless make the suggestion that? 


.. many factors peculiar to the individual school system must be considered 
in the interpretation of tests, such as: the legal age of school entrance, the actual 
average age of school entrance, rates of acceleration and retardation, rates of 
elimination from school, percents of failures of pupils, genuine differences in instruc- 
tional efficiency, and variations in average mental and educational capacity from 
school to school. 


Harris* has called attention to a rather common error made in the in- 
terpretation of grade norms at the primary level. It arises from the failure 
to take into aecount the fact that zero performance on an achievement 
test is 1.0. A first-grade class whose grade score at the end of the year is 
2.0 has made only normal progress for the year. 

Other norms for achievement tests. Several other types of norms 
are used, some of which require brief mention. Of these, doubtless the most 
important are percentile norms. As in the case of intelligence tests already 
discussed, such norms interpret a pupil's score by describing his position 
in the group in terms of the per cent of pupils who fall below the score made. 
Generally all ‘percentile ranks from 0 through 99, but sometimes only cer- 
tain points, such as the 25th, 50th, and 75th, are given. These percentile 

“Summarized by Giles M. Ruch from an unpublished master's thesis by Eunice 
Adams, The Comparative Reliability of Eight Arithmetic Tests, University of California, 
1929, in Review of Educational Research, 3: 39, February, 1933. 

“© Giles M. Ruch and David Segel, Minimum Essentials of the Individual Inventory in 
Guidance, page 82. Washington: United States Office of Edueation, 1939. 


*? Albert J. Harris, “Note or a Source of Error in Interpreting Grade Norms,” 
Journal of Educational Research, 39: 151-153, October, 1945, 


THE USES AND LIMITATIONS OF NORMS 295 


ranks are very easy to interpret, but they have two limitations, neither 
of which is usually very serious for most purposes. One is that the scale 
values are unequal in length, and the other is that percentile values in one 
grade or age group are not directly comparable with those in another. 

Standard scores are also used. They are interpreted in the same manner 
as are similar scores on intelligence tests. These have already been dis- 
cussed, MeCall has proposed a modification called a T-score based upon a 
standard group composed of 12-year-old pupils. All age and grade groups 
are described by locating their T-score position in this 12-year-old group. 
The mean is given a value of 50, and each standard deviation distance 
above and below is divided into tenths, each counting one point. For ex- 
ample, a 15-year-old pupil makes a reading score on the Thorndike-McCall 
Reading Test, which, according to the table of norms, is a T-score of 60. 
Tn other words, this pupil is located le distance above the mean of typical 
12-year-old pupils who have taken the test. The T-score technique is some- 
times used with other age groups. The principal limitations of the T-score 
are that it is not well adapted to high-school tests and that it is rather 
cumbersome even for grade-school tests. 

Value of local norms. Practically all norms published on tests are 
so-called national norms. When such norms are carefully derived, they are 
of great value in interpreting the scores. It is easy to overemphasize their 
value for the ordinary school and school system, however. They must never 
be taken as standards. There are such wide variations in the length of 
school terms, in the equipment of schools, in the training and experience 
of teachers, and in other important respects among the several states and 
among the school units of any one state as to make any single series of 
norms for the whole nation inadequate. National norms must be supple- 
mented by norms for the state, county, and city school systems, and even 
for the individual school. What is really important in most cases is the 
comparison of grades, classes, and schools which operate under approxi- 


mately the same conditions. Lindquist* has pointed out several distinct 


advantages of the regional testing programs used in Iowa for а number of 
years. 

For purposes of classification, what is needed is a set of norms for the 
school itself. To derive satisfactory local norms, all that is required is to 
combine all pupils in the same grade and then to compute standard scores 


or percentiles. If age norms are desired, the pupils will be distributed ac- 


cording to CA or MA, and the medians computed, In larger schools and 
school systems norms should be derived for slow and rapid learners as well 


as for average or normal learners on each grade level. 
Tt would appear, then, that the more specific the norm the more useful 


- 


е E, Е. Lindquist, “Nationally Coordinated Regional Testing Programs," in New 
Directions for Measurement and Guidance, pages 87-103, Washington; American Council 
on Education, 1944, ‘ 


296 THE TESTING PROGRAM 


it becomes. Educators are coming to recognize that each individual has 
his own unique pattern of growth. This position is clearly stated as fol- 
lows: 


The time has come when we should cease to be primarily interested in comparing 
one child with another, one class with another, or any class with a norm. We should 
be primarily interested in comparing each child with himself, with his past record, 
and with his potentialities. To center attention elsewhere is to miss the point—to 
miss the service which tests can render. 


Figure 37 is the test-score profile of a college junior, Richard Roe, based 
upon local norms at “Siwash”’ College. He was administered an intelligence 
test; an English battery consisting of grammar, organization, and reading 
tests, the latter having vocabulary, reading speed, and reading level 
subtests; and a five-test achievement battery: history, literature, science, 
art, and mathematics. The norms are shown both as percentile ranks and 
as "stanines" (standard scores on the basis of 9 points—that is, running 
from 1 through 9). TE means “total English score," and ГА means “total 
achievement score." The intelligence, TE, and ТА points are connected 
with a solid line, while the battery tests and subtests are joined by dotted 
lines. 

In consultation with his adviser, Richard can see at a glance that he is 
above the Siwash junior class average in general, but that his history, 
literature, and art scores fall considerably below the other eight points. 
Richard and his adviser can use this information to good advantage in 
planning his last two years of college courses. 


E. Methods of Comparing Intelligence and Achievement 


One of the most important questions to raise about any pupil is: How 
well is he getting along in comparison with his capacity? Whenever intel- 
ligence tests-and achievement tests for the pupil have been expressed in 
comparable terms, a rough: answer to this question is possible. But the 
problem is far more difficult than would appear on the surface. According 
to Kelley, about 90 per cent of whatever is measured by a so-called gen- 
eral intelligence test is the same as that measured by an all-around achieve- 
ment test battery. In like manner he has computed the “community of 
function” between intelligence tests and arithmetic tests as about 88 per 
cent, and that between intelligence and reading tests as about 92 per cent. 
Since a “scant one-tenth” of the tests are utilized in the measurement of 
difference between intelligence and achievement, he points out the serious 
hazards of such comparisons, It would certainly appear that unless the 


^^ Douglas E. Scates, “The Improvement of Classroom Testing,” Review of Educa- 
tional Research, 8: 532; January, 1939, 

“i Truman Lee Kelley, Interpretation of Educational Measurements, page 208. Yonkers: 
World Book Company, 1927. б I 


THE USES AND LIMITATIONS OF NORMS 297 


‚ edited by Walter S. Monroe, pages 340-344. 


Testing Program, Siwash College 
Name RICHAR H. Sex Е Date Sepremaer 21.944 | 
Last First Middle 
Local Mailing Address $73 Cortese STREET р i 
зы English ue "Achievement ERETNA 
centi -|z t: Mix = 
Percentile: ВАЗЕ АЕ ey TI WEGE NR OR Men | 
99 0 eer uisi uoo hy iB e e a a 
98 - ba d - =- >= - E - - - c = 
М ^ H 
97 c EM IE m cle iun 
96 - P - - = - - m - - Ра - | 
94-95 \ 
-95 о Тер phe ^ mp d 
93 8 - - B - - - - - - - 
22 = EE eT t cob tin 
90-91 282 аен ара ШИВ а 
ПТО are E " + 
86-87 SP e и < gum Pep on 
Ў и - \-/- = £ 
i 83-85 7 ig 5M i en t unu» us 
81-82 ea TS С растао ралита риси о, e 
78-80 -|- = = -— = =| = E ic M Mens the 
75-77 -|- = lA et s -|= - + TEE Be 
71-14 аа олс С te VON rer ero d 
68-70 г 6 РЕ - - ee ee - - - | - Пе jj = - 
64-67 = Р N =j a - - iE plat у, 
60-63 Bd Аа nva. ог. Eee аи E M 
56-59 с | — SERS usps eder aere]: us pte Vine 
52-55 ! MEE EA UM NEN TNCS PPS res 
49-51 gegen vitm A арар HY! ARG p 
45-48 Чоо оре and gale tel ble е = 
41-44 - -— wA = 1 ае =| — ag tem - = = A 
37-40 = Eun a l Vase L E = =e z " 
33-36 р ЕРОТ RE ter Heres | riri Ё а Занат уым 
30-32 4 С Е ЧИ НЕН te FEST а, 
26-29 ИОА ИЕ ИКРУ оа al oE 
23-25 = E - pr - -- = - = 
9-10 Rif thie) Eius 2 A 
8 ai oet ete e C re Oe p eher nr! Em t 
т 2 у о 
5- 6 Re UM d ЛАА ООР nM ae а 
4 Р РЕБЕ ae em = — a WEN 
3 1 iir Bhat bem - = - - 
2 | SHEE e em i= а sabe 
1 da | eel See ee a eee ee etre 


Figure 37. A Profile Based Upon Local Norms. 


ery great, ће ехівіепсе of true differences cannot 


apparent differences are V t ue differe: 
d 48 has described the conditions which must be 


be safely inferred. Conra 
met before scores are strictly comparable. : " 
Е in Encyclopedia of Educational Research, 


о Measures 
HH spa о New York: The Macmillan Company, 1941. 


298 THE TESTING PROGRAM 


The accomplishment quotient. In 1920 Franzen“ suggested the ac- 
complishment quotient, abbreviated AQ. This is the ratio between EA and 
MA, or between EQ and IQ. The simplest formula is AQ = 100(EA + MA), 
A quotient of 100 is considered the goal. For example, if a pupil whose EA 
is 9-2 has an MA of 10-0, his AQ is 9.17 + 10.00 = 91.7 = 92. In like 
manner, a second pupil might have the same EA, 9-2, but an MA of only 
8-3. His AQ would be 9.17 + 8.25 = 111. The interpretation of the first 
case, 92, is that the pupil is not living fully up to his capacity, which seems 
reasonable enough, human nature being what it is. But the interpretation 
of the second case, 111, is rather absurd, since it appears to imply that 
this pupil has exceeded what he is capable of doing by 11 per cent! A more 
probable explanation is that the quotient is due to inaccuracies in the tests, 
and that in this ease the errors in the achievement score were in the direc- 
tion of making it too high, whereas the errors in the intelligence score were 
in the direction of making it too low. The resulting quotient has added 
these errors. If the errors, whether due to chance or otherwise, had been 
in the same direction, they would have tended to offset each other. One 
reason the use of IQ and EQ involves less risk is that CA, the divisor, is 
almost wholly free from errors of measurement if obtained from a birth 
certificate. Studies by Cureton,' Haggerty, Тао, and others have brought 
the AQ and similar measures into general disrepule. 

Combining intelligence and achievement scores. Several proposals 
have been made for combining scores on intelligence and achievement tests, 
usually for purposes of pupil classification. One of the simplest of these 
proposals is to average the pupil's rank on the two tests. More refined 
methods involve the use of some common denominator, such as the stand- 
ard score. One publication Suggests the use of vromotion age and promo- 
tion quotient as a basis of classification for instructional purposes. Promo- 
tion age (PrA) is the average EA and MA. In this average the two ages 
may be weighted equally or unequally, whichever seems best for the data 
in hand. Then the promotion quotient (PrQ) is the PrA + CA. On the face 
of it, such practice appears to be averaging things as unlike as cattle and 
horses. But, if Kelley’s point regarding the great community of function 
between intelligence and achievement tests is well taken, the practice would 
appear to be justified on theoretical grounds. And if it provides a better 


47 Raymond Franzen, “The Accomplishment Quotient,” Teachers College Record, 21: 
432-440, November, 1920. 

“Edward E. Cureton, “The Accomplishment Quotient Technic,” Journal of Experi- 
mental Education, 5: 315-326, March, 1937. 

* Lida Harmar Haggerty, “An Evaluation of the Accomplishment Quotient: A Four 
Year Study at the Junior High School Level,” Journal of Experimental Education, 10: 
78-89, September, 1941. 

*" Fei Tsao, “Is the AQ or F Score the Last Word in Determining Individual Effort?", 
уе of uaa Psychology, 34: 513-526, December, 1943. 

*! Supervisors Manual for the Metropolitan Achievemen = ers: 
World Book Company, 1933. Bd RENS M rur 


THE USES AND LIMITATIONS OF NORMS 299 


basis for grouping pupils, as appears often to be the case, it has justified 
itself in practice. 


F. The Use of Norms in Interpreting Scores on 
Personality Tests 


As a rule, personality test manuals do not contain elaborate systems of 
norms. The problem is inherently more difficult than that presented by 
either intelligenee or achievement tests, for which we have seen that the 
norms are far from ideal. Indeed, the very essence of personality is its 
uniqueness. Tt is here that the good judgment and common sense of the 
teacher are highly important. 

Terman® strongly questions the possibility, or desirability, of establish- 
ing norms for evaluating or adjusting personalities. He says: 


The psychologist stands aghast at the self-assurance with which the professional 
school counselors in America diagnose the personality faults of little children and 
at the boldness with which they undertake the delicate task of adjustment. . . . 
The student of genius who is familiar with the motivating influences that have 
their origin in quirks of childhood personality shudders to think what the result 
would have been if school counselors had had a chance to “adjust” the personalities 
of the budding geniuses of history. One can imagine them, freed from all their 
peculiarities and complexes, adjusted to the world as it was and becoming indis 


tinguishable from the common herd. 


On the same point Poffenberger* quotes with approva! this stavement 
from Burbank,* growing out of a lifelong study of plant life: 


One of the greatest fallacies of near science and of amateurs in Nature’s school 
is the belief that only from the normal can we get our best development and re- 
sults. As a matter of fact, Nature shows us again and again that itis from abnor- 
malities that some of our most valuable and beautiful plants ange: V. From that 
weak, or abnormal plant—that genius plant—may come the very characteristics 
that we are looking for, and our only problem is to purse it physically and keep it 
strong to pass on its overload of spiritual or esthetic essences to its children. 


Probably the professional educator could hard'y do better than to accept 
wholeheartedly the motto of the founder of the eugenics movement in 
England, which was “Treasure your exceptions.” Those who deviate most 
widely from the average deserve special consideration. It is from this group 
that geniuses are recruited as well as social misfits of all types. It is socially 
undesirable, as well as psychologically impossible, to make everybody alike. 


A distinguished psychiatrist gives this wholesome comment:** 

5? Lewis M. Terman, “The Measurement of Personality," Science, 80: 605-608, De- 
cember 28, 1934. ? < б 

53 Albert Т. Poffenberger, “Psychology and Life,” Psychological Review, 43: 30, Janu- 


ary, 1936. 
з Luther Burbank and Wilbur Hall, The Harvest of the Years, page 273, Boston: 


Houghton Mifflin Company, 1927. : Ae i 
55 Karl A. м. The Human Mind (Third Edition), page xiv. New York: 


Alfred A. Knopf, Inc., 1945. 


300 THE TESTING PROGRAM 


The adjuration to be "normal" seems shockingly repellent to me; I see neither 
hope nor comfort in sinking to that low level. I think it is ignorance that makes 
people think of abnormality only with horror and allows them to remain undis- 
mayed at the proximity of "normal" to. average and mediocre. For surely anyone 
who achieves anything is, a priori, abnormal; this includes, not only the geniuses, 
but the presidents, the leaders, and the great entertainers. I presume most of the 
people in Who's Who in America would resent being called normal. 


As a summarizing statement concerning norms, Flanagan’s opening sen- 
tence is highly pertinent :** 


Test scores are meaningful and valuable to the extent that they can be interpreted ў 
in terms of capacities, abilities, and accomplishments of educational significance. 


'Зерестер REFERENCES FOR FURTHER READING 


Chauncey, Henry, and Frederiksen, Norman, “The Functions of Measurement in 
Educational Placement,” Chapter 4 in E. F. Lindquist (Editor), Educational 
Measurement. Washington, D. C.: Ameriean Council on Education, 1951. 

Cook, Walter W., “The Functions of Measurement in the Facilitation of Learning," 
Chapter 1 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Flanagan, John C., The Cooperative Achievement Tests, A Bulletin Reporting the 
Basic Principles and Procedures Used in the Development of Their System of Scaled 
Scores. New York: Cooperative Test Service, 1939. 41 pages. 

Flanagan, John C. (Chairman), “Establishing the Type of Norms Most Useful and 
Important for the Interpretation of Achievement Test Scores,” pages 65-113 in 
Proceedings of the 1948 Invitational Conference on Testing Problems. Princeton, 
New Jersey: Educational Testing Service, 1949. Papers by Lee J. Cronbach, 
Walter N. Durost, Eric F. Gardner, E. F. Lindquist, Robert L. Thorndike, and 
Arthur E. Traxler, — 

Flanagan, John C., “Units, Scores, and Norms,” Chapter 17 in E. F. Lindquist 
(Editor), Educational Measurement, Washington, D. C.: American Couneil on 
Education, 1951. 

Rulon, Phillip J., "Problems of Regression," Harvard Educational Review: 11: 
213-223, March, 1941. 

Rulon, Phillip J., “On the Concepts of Growth and Ability,” Harvard Educational 
Review, 17::1-9, Winter, 1947. - 

Seashore, Harold G., and Ricks, James H., Jr., “Norms Must Be Relevant,” Test 
Service Bulletin No. 39, pages 1-4. Psychological Corporation, 522 Fifth Avenue, 
New York 18, New York, May, 1950. 

Stanley, Julian C., “Why Wechsler-Bellevue Full-Scale IQ's Are More Variable 
Than Averages of Verbal and Performance IQ's,” Journal of Consulting Psy- 
chology, 17:419-420, December, 1953. 

Tiedeman, David. V.,. Has He Grown?" Test Service Notebook No. 12. World Book 
Company, Yonkers-on-Hudson, New York, 1952. 4 pages. 


) s John C. Flanagah, “Units, Scores, and Norms,” page 695 in E. F. Lindquist 
(Editor), Educational Measurement, Washington, D. C.: American Council on Educa- 


tion, 1951. 


PART IV 


Measurement 
in 
Instruction 


ll 


Motivation and Practice as 
Related to Testing 


_ дд 


А. The Problem of Motivation 


Importance of motivation. It is generally recognized in ordinary expe- 
rience that motivation occupies an important place in human affairs. Such 
familiar proverbs as “You can lead a horse to water but you can’t make 
him drink” and “It is hard to teach an old dog new tricks” assign to mo- 
tivation a key position. The horse does not drink for the simple reason that he 
does not want to drink, and the old dog's voor verformance isdue not so much 
to lack of ability as to the fact that he has become too well satisfied with 
the tricks he already knows. In a like manner, every experienced teacher 
has seen pupils of mediocre capacity succeed because of interest and en- 
thusiasm, while others of more promise have failed utterly because of lack 
of it. With these observations growing out of ordinary experience the views 
of psychologists and other keen students of education are in accord. 

Meaning of motivation. The term motivation is very inclusive. Liter- 
ally it means causing movement. А convenient grouping of motives into 
two major classes is suggested—internal or organic, and external or envi- 
ronmental. In recent years the term drive, or urge, has been used for the 
former, and goal, or incentive, for the latter. But in the final analysis, mo- 
tivation, though in some instances externally initiated, always functions 
internally. Hunger, thirst, and sex, as well as interests, attitudes, wants, 
desires, and temporary mental sets, are examples of drives. Incentives may 
be negative, as are pain or punishment, or positive, as are rewards in a 
multitude of forms. A further distinction is often made between motives 
such as a child’s interest in play or the 


303 


which are natural or intrinsic, 


304 MEASUREMENT IN INSTRUCTION 


movies, and those which are artificial or extrinsie, such as prizes, marks, 
grades, credits, and honor rolls. 

Relation of measurement to motivation. Measurement is related 
in at least two ways to motivation. In the first place, there is the problem 
of the measurement of motivation itself. It is often important to know the 
differences among individuals in the strength of various motives, the com- 
parative strength of the same motive under varying conditions, and the 
strength of a given motive in comparison with other motives in the same 
individual. As the development of wholesome attitudes and interests is an 
objective of modern education, it is just as necessary to know how to meas- 
ure it as to know how to measure any other objective. While much valuable 
work has been done in the measurement of animal drives, up to the present 
no convenient technique has been devised for measuring human motives 
in any precise manner. The measurement of motivation in education is, 
then, a problem for the future. 

In the second place, there is the problem of the relation of the measure- 
ment of educational capacity and achievement to the motivation of learn- 
ing and teaching. Since teaching and learning are two aspects of the same 
process, it is reasonable to expect that measurement will be intimately 
related to both. Some of the more important relationships will be consid- 
ered in the next two sections. 


B. The Relation of Measurement to Motivation in Teaching 


Purpose of the teacher and measurement. An obvious relationship 
of measurement and motivation in teaching arises from the fact that the 
purpose of the teacher determines the type of measurement used. Whether, 
for example, the teacher gives many tests or few, long tests or short, in- 
formal tests or standardized, survey tests or diagnostic, depends upon his 
purpose. Since not all tests serve the same purpose equally well, as has been 
pointed out, the choice of the measuring instrument becomes a matter of 
primary importance. This problem has already been considered at some 
length in Chapter 4. Certain points to be raised later in this chapter will 
also have a bearing upon it. 

Teaching emphasis and measurement. The proper teaching em- 
phasis is determined by the results of measurement. Measurement directly 
demonstrates the quality of the pupil's learning, but it also indirectly 
reflects the quality of the teacher's teaching. In the light of measured 
results conscientious teachers attempt as far as possible to correct weak- 
nesses in past teaching and to prevent their recurrence in future teaching. 
Messenger! studied the “influence of the Iowa Academic Testing Program 
in relation to the teaching of English mechanics in an Iowa high school" 
and found that the effect had been to “motivate teachers to greater effort,” 
with the allotment of more time to the subject and the use of more drill 


‘Unpublished master's thesis, University of Iowa, 1934, 


j 


== 


MOTIVATION AND PRACTICE IN TESTING 305 


material. One of the chief values of measurement may well be its motivating 
effect upon the teacher. 

Taba early realized this relationship and spoke regretfully of the “formi- 
dable and serious handicap” of the progressive schools due to the “lack of 
forms of testing that are in harmony with their aims and adequate to their 
purposes." Then she added these significant words:* 


After all, one teaches only what one, in some way or another, is able to evaluate 
as an outcome of that teaching. If we are unable to evaluate the growth of in- 
tegrations and meanings and ways of behavior, we are unable even to form an 
adequate notion of them, still less to guide the process of learning in these terms, 


Attention should be called to the fact that Taba’s recognition of the limi- 
tations of existing measurement does not blind her to the important moti- 
vating influence of measurement, whether good or bad, and to the urgent 
necessity for continuous improvement of measuring instruments. 

There is, of course, some danger that the content of the examination 
may exert too great an influence over the teaching emphasis and curricu- 
lum content. It has often been alleged that something like this has hap- 
pened in the case of the New York Regents and the College Entrance 
Board examinations, where an important effect of such examinations has 
been to turn secondary schools into “cramming” schools pointing toward 
the probable content of such examinations as revealed by past examina- 
tions by the same agencies. Insofar as this is true, it represents unfortunate 
and unwarranted control over teaching procedures, which not only defeats 
the purpose of the examination but places an obstacle in the way of educa- 
tion itself. Such practice fails to take into account the sampling nature of 
all tests. Any attempt to drill pupils in advance specifically upon the items 
of the test tends to narrow teaching to the scope of the testing, so that the 
two tend to become synonymous rather than one à random sampling of 
the other. This should be clearly understood, and every reasonable pre- 
caution should be taken to insure that state-wide testing programs and 
other forms of academic competition do not promote mere monotonous 
drill exercises of an especially narrow and vicious type. 

A few studies have been made of the effect of exempting pupils from 
final examinations. An early study by Anderson? indicated that exempting 
high-school pupils who reached a minimum standard from final examina- 
tions had “played havoe with teachers’ grades,” while the actual perform- 
ance of the pupils had shown “no appreciable increase." Apparently the 
motivation had been in the wrong place. That such a disastrous result need 
not occur is indicated by a subsequent study of the same problem reported 


? Hilda Taba, The Dynamics of Education, page 185. New York: Harcourt, Brace and 


Company, Ine., 1932. i к 
5 C, J. Anderson, “Is tne Exemption System Worth While?" School and Society, 3: 


357-360, March 4, 1916. 


306 MEASUREMENT IN INSTRUCTION 


by White,! who found that the general distribution of marks in his school 
had changed very little under the exemption system. Even here, however, 
was found a “decided dip in the distributions immediately below the ex- 
emption point of 85 per cent and a corresponding rise just above it." 
Schools employing an exemption system should remain constantly on guard 
lest its effect be more to stimulate the teachers to give high marks than the 
pupils to earn them. 


C. The Relation of Measurement to Motivation in Learning 


Close as is the relationship of measurement to the motivation of teach- 
ing, the relationship is even closer to the motivation of learning. Elsewhere 
Ross stated the problem as follows: 

Behind the act of learning is the capacity to learn, and back of the capacity is 
the motive to learn—the desire, urge, impulse, drive, or something, that makes the 
creature want to learn, that pushes him out to meet his environment. One of the 
reasons why most of the correlations of mental capacity with actual achievement, 
in school and out, have been disappointingly low, has been that students of real 
ability have not felt a proper urge to work, while those of mediocre talent have 
frequently possessed the urge to achieve. 

If we are ever to be successful in our efforts to predict achievement, therefore, 
we must not be content with merely analyzing the learning process, understanding 
the mechanism of learning, its structure and laws of operation, nor with merely 
exploring the height and range of human possibilities; but we must also find out 
about the dynamic aspects of human nature. We must discover not only how the 
mind works, but why it works when it does and the way it does. 


The foregoing statement is an introduction to a report of an experimental 
attack on one phase of motivation. The study will be briefly presented here 
as an illustration of one type of psychological experiment concerned with 
this important problem. Afterwards some critical comments upon this and 
other similar experiments will be given. 

An experiment in motivation. The problem was to determine the 
influence of a knowledge of results upon the achievement of 59 college 
students in a simple act of motor skill, making tally marks (IW). The pro- 
cedure was as follows: Upon the basis of an initial practice period three 
equivalent groups were formed. One of these, the control, had no knowledge 
of its progress, throughout the first ten practice periods of one minute each. 
During this time one experimental group had full knowledge of results, and 
the other had partial knowledge. At the beginning of each practice period 
after the first, each pupil in the group with full knowledge was shown his 
paper of the preceding day, with scores and corrections indicated. A distri- 
bution of scores for this group was placed on the board, and each student 

4 Clyde W. White, “The Effects of Exemptions from Semester Examinations on the 
Distribution of School Marks,” School Review, 39: 293-299, April, 1931. 


* Clay Campbell Ross, “Ап Experiment in Motivation,” Journal of Educational Psy- 
chology, 18: 337-346, May, 1927. Page 337. 


MOTIVATION AND PRACTICE IN TESTING 307 


was urged to watch his daily progress, both relative and absolute. In the 
experimental group with partial knowledge of results, each student was 
told whether he was above or below the average of the group, but that was 
all. At the end of ten practice periods conditions were reversed; the group 
which had had no knowledge of results was then given full knowledge for 
two additional periods, and the other two groups were given no knowledge 
for these two periods. 

The results are shown in Figure 30 on page 266. On the whole, they 
seemed to justify the conclusion that “the addition of a single other moti- 
vating factor, knowledge of results, is sufficient to give the pupils with such 
knowledge a distinet superiority over the others, and the degree of supe- 
riority is roughly proportional to the amount of information possessed." 

Limitations of experiments on motivation. The experiment, just 
summarized illustrates several of the weaknesses of the experimental work 
so far reported on motivation and, for that matter, on other phases of 
learning as well. These may be conveniently grouped under three headings: 
factors studied, subjects used, and conclusions drawn. 

In the first place, the factors so far studied leave much to be desired. 
Most of the studies involved are concerned with highly artificial and often 
trivial tasks. Take the above experiment as an example. It is highly im- 
probable that students will show much enthusiasm over making, as rapidly 
as possible, groups of four vertical lines crossed horizontally with a fifth. 
Tallying, to be significant to most persons, would have to be employed as 
a record of some athletie contest or other situation in which it is a means 
to an end and not an end in itself. A survey of the literature reveals that 
a large percentage of motivation studies have been concerned with making 
legible a’s, canceling numbers or letters, assigning a number to a dictated 
word, learning trivial facts or actual misinformation, running mazes, and 
the like. What we need to know is how people behave under actual school 
conditions, Even when arithmetic and other school materials are employed, 
the experiments are rarely carried on long enough for the novelty of the 
task to wear off and for the experimental factor to operate under reasonably 
normal conditions. The total learning time is frequently less than one hour, 
and often it is only ten to fifteen minutes, as in the above laboratory experi- 
ment. To be most helpful in guiding teachers in the day-by-day conduct of 
their classes these experimental factors must be continued for at least sev- 
eral weeks. 

In the second place, the choice of subjects for the experiment has usually 
been rather unfortunate. Many of the laboratory studies of the effects of 
rewards and punishments have been limited entirely to animals. No matter 
how thorough a believer one might be in evolution, one must certainly see 
that the behavior of rats in a maze or cats in a puzzle box might very well 
be different in essential respects from that of school children facing the 


308 MEASUREMENT IN INSTRUCTION 


intricacies of a foreign language or learning to manipulate the abstract 
symbols of algebra. Even when human subjects have been used, as they 
have been in many experiments, they have usually been adults, often stu- 
dents of psychology. Frequently, also, the number of subjects has been 
small, with a poorly equated control group, or with none at all. To be most 
convincing as a guide for school practice these experiments must be per- 
formed with children of approximately the same age and type as those 
found in the schools where the results are applied. If our studies of either 
maturation or learning are to be relied upon, the child is not merely a 
miniature adult. And if the numerous studies of individual differences from 
the time of Galton to the present have established anything, it is that what 
is true of one person is not necessarily true of another. Experiments based 
on à handful of subjects, therefore, must be accepted with considerable 
discount. 

In the third place, the conclusions drawn have often gone far beyond 
the experimental facts available. This is mainly the result of the two limi- 
tations already mentioned. If experimenters would be content to draw con- 
clusions from, and to make applications to, the same or closely similar 
subjects and to the same or closely similar tasks, no harm would be done. 
But this is rarely the case. To generalize from one age level to another is 
risky even when the task is the same, but it is particularly hazardous when 
the activity itself is different. Yet this very thing is commonly done. 
A meaningless and often trivial act is performed by adults under the highly 
artificial conditions of the laboratory. Then the results are applied without 
qualification to the meaningful learning of children under the actual con- 
ditions of the schoolroom. That this procedure is wholly unwarranted will 
appear from an experiment by the writer to be reported later in the chapter. 
But to generalize from the behavior of a rat in a psychological laboratory 
to that of a child in an ordinary classroom is little less than foolhardy. 

Types of motivation experiments. From what has just been said, 
it may appear that the writers believe all motivation experiments to be 
worthless. Such is far from their conviction, however. While they believe 
that practically all such experiments so far reported have certain weak- 
nesses to which attention should be called, they are convinced that genuine 
progress has been made and that the way has been opened for further 
studies to supplement those already in existence. It has been definitely 
shown that the problem, although difficult, is susceptible to experimental 
attack. Furthermore, experimental evidence already available is sufficient 
to provide at least tentative answers to two important questions: 


1. What is the relation of measurement to the amount and quality of learning? 


2. What is the relation of measurement to the type of learning, or to the learning 
procedure followed by the student? 


These two questions will now be considered. 


-———————— 


MOTIVATION AND PRACTICE IN TESTING 309 


I. The Relation of Measurement to the Amount and 
Quality of Learning 


There is considerable experimental evidence regarding the influence upon 
the amount and quality of learning of three measurement factors, or groups 
of factors: 


a. The frequency of the tests. 
b. The knowledge that a final examination would be given. 
c. The knowledge of results or progress in learning. 


Attention has been given to the operation of these factors individually, in 
combination with each other, and in combination with other motivating 
factors, such as praise and blame, rivalry, and various types of material 
rewards. Some of these findings will now be summarized. 

Frequency of tests. Practice varies widely regarding the frequency of 
testing. At one extreme are the teachers who give no written examination 
of any kind, and at the other extreme are those who give a test of some kind 
every day. What experimental evidence is there to indicate the proper fre- 
quency? Manifestly, whatever advantage may exist in frequent testing 
cannot be attributed solely to the motivating effects, however, since the 
additional practice afforded by taking the extra tests must also be con- 
sidered. 

The experimental evidence regarding proper frequency of testing in social 
studies classes in high school has not been very convincing. Hoglan® studied 
the frequency of testing in Ameriean history in one Iowa high school. He 
found no significant differences among three groups, equated on the bases 
of intelligence and knowledge of history. One group had daily tests, another 
had three unannounced tests per week, and a third had only the regular 
tests at intervals of six weeks. In a similar study in the same subject Camp? 
found a slight (but not a statistically significant) difference between one 
group tested once or twice a week and another tested once in two or three 
weeks. In an earlier study in community civics Shore* found no advantage 
in giving each day a true-false test of ten items; but a group given an un- 
announced test two or three times a week did show a statistically significant 
superiority over a group given only the mid-semester and the final tests. 
On the whole, the evidence appears to favor slightly the practice of giving 
a test once or twice a week to classes in the social studies in high school. 

Two experiments on the effects of frequent testing of high school biology 
students reported conflicting results. Kitch? found that the group taught 
with the aid of self-scored unit tests did significantly better than the group 


$ Unpublished master's thesis, University of Iowa, 1932. 

? Unpublished master's thesis, University of Iowa, 1931. 

* Unpublished master's thesis, University of Iowa, 1925. у б 
‘Loran V. Kitch, unpublished master's thesis, University of Southern California, 


1932, 


310 MEASUREMENT IN INSTRUCTION 


without such tests. Gable’ compared the merits of three procedures. One 
group was told that it would be tested each day, another group that it 
would be given announced unit tests, and a third group understood that 
it would be tested without notice at irregular intervals. On the whole, the 
poorest record was by the group taking daily tests, but there was a tend- 
ency for the slower pupils to do better when a test was announced which 
gave time for review. 

Conner" found that the use of a well-known series of instructional tests 
in high-school physics had not resulted in sufficient improvement in learn- 
ing to justify the time expended. Kugle! reported that short daily tests 
in physics resulted in pupils’ having a small superiority over those to whom 
tests were given only at the ends of units. Kirkpatrick ? found a distinct 
advantage, in the 26 high-school physics classes included in his study, in 
giving an objective test at the beginning of each unit. As each unit covered 
from one to three days, this meant that tests were given at least twice a 
week. The pupils had definite knowledge that the test would be given, that 
it would cover all the important concepts of the unit, and that the final 
examination would include only points included in these unit tests. The 
tests were corrected in class and were used as a basis for class discussion and 
subsequent study. Both experimental and control groups took the same 
term tests at intervals of six weeks. When the experimental groups were 
considered as a whole, a highly significant statistical difference was found 
on a test of objective information given at the end of the course, but this 
superiority had largely disappeared four months later. The testing program 
was most beneficial to the pupils in the lowest third in mental ability. This 
suggests that schools attempting to group pupils according to ability may 
very well consider varying the testing program as well as the curriculum 
and teaching methods. 

Experiments involving the frequency of testing have been most numer- 
ous and, on the whole, most convincing on the college level. A serious lim- 
itation, however, is that they have been largely restricted to classes in 
general and educational psychology. Jones," in a pioneer study, gave five- 
minute completion tests, euphoniously called “terminal reviews,” at the 
end of each of 27 lectures in psychology. Eight weeks later the groups 80 
tested made scores on a final examination that were approximately twice 
as high as those of groups who had had no "terminal reviews." Another 


10 Sister Felicita Gable, The Effect of Two Contrasting Forms of Testing U pon Learning. 
Baltimore: Johns Hopkins Press, 1936. 

11 Unpublished master's thesis, University of Iowa, 1932. 

12 Unpublished master’s thesis, Pennsylvania State College, 1936. 

1 James Earl Kirkpatrick, “The Motivating Effect of a Specific Type of Testing 
E rur id [s Iowa Studies in Education, 9: 41-68, June 15, 1934. 

arold Е. Jones, "Experimental Studies of College ing," i Psychol- 

ogy, 68: 36-70, November, 1923, NE hela af, Pr 


MOTIVATION AND PRACTICE IN TESTING 311 


study reports advantages to be gained in using weekly objective tests in 
general psychology. 

Both Turney !5 and Keys" found weekly tests in educational psychology 
better than tests given less frequently. Turney found that, when given 
weekly tests, a class which was well below another class at the beginning 
was able to equal the achievement of the other class, which had only one 
short test, in addition to the mid-semester and the final, which both groups 
had. Keys found that eight weekly tests gave an advantage over the same 
items given to equivalent groups in the form of two monthly tests. How- 
ever, on an unannounced examination covering the same material given 
five weeks later, this advantage had been reduced. When the regular final 
examination came, after the additional two weeks, the achievement of the 
experimental and control groups was practically identical. What the effect 
of the weekly testing was after still larger intervals is, unfortunately, un- 
known. 

Johnson!5 compared the effect of written unit tests and the effect of an 
equal amount of time devoted to oral reviews with 55 pairs of freshman 
girls in two classes in child development. She found that a statistically 
significant difference in favor of the tested group had disappeared twelve 
weeks later. She concluded that “there is as yet no evidence to show that 
the greater achievement which has been induced by examinations persists 
after six weeks to three months." 

A few studies have reported little or no advantage in weekly tests, even 
when comparisons were made at the end of the course. For example, 
weekly tests in general psychology at the Univeristy of Minnesota gave 
negative results.” Both Noll? and Ross and Henry?! found a slight supe- 
riority in less frequently tested groups in educational psychology. However, 
Ross and Henry in both general and educational psychology, and Noll in 
educational psyehology, found evidence that the benefit of weekly tests was 
greatest for the students of low ability. It is evident that there is no one 

15 С. С. Ross and Lyle К. Henry, “The Relation between Frequency of Testing and 
Progress in Learning Psychology,” Journal of Educational Psychology, 30: 604-611, 
November, 1939. 

16 Austin H. Turney, “The Effect of Frequent Short Objective Tests upon the Achieve- 
ment of College Students in Educational Psychology,” School and Society, 33: 760-762, 

saa 
и. “The Influence on Learning and Retention of Weekly as Opposed to 
Monthly Tests,” Journal of Educational Psychology, 25: 427-436, September, 1934. 

13 Bess E. Johnson, “The Effect of Written Examinations on Learning and on Reten- 
tion of Learning,” Journal of Experimental Education, 7: 55-62, September, 1938. 

? A, С. Eurich, H. P. Longstaff, and M. Wilder, The Effective College Curriculum as 
Revealed by Examinations, pages 333-347. Minneapolis: University of Minnesota Press, 
Is Victor H. Noll, “The Effect of Written Tests upon Achievement in College Classes: 
An Experiment and a Summary of Evidence," Journal of Educational Research, 32: 
345-358, January, 1939. 

71 C, C. Ross and Lyle К. Henry, op. cit., pages 609-610. 


312 MEASUREMENT IN INSTRUCTION 


best testing technique which is equally effective under all conditions. Test- 
ing methods as well as other teaching procedures must consider the ability 
of the student as well as the nature of the subject. 

Kulp? gave the students in a graduate class in educational sociology who 
were below the median on the mid-semester examination a weekly ten- 
minute objective test for the next seven weeks. The students above the 
median were exeused from these short tests. On the final examination, 
"identical in all respects with the seven weekly tests,” the superiority of 
the upper half was reduced considerably, probably due largely to practice 
and regression effects rather than to increased motivation. Pressey” re- 
ports an interesting variation of this procedure as used by Smeltzer in edu- 
cational psychology. Both experimental and control classes were given 
weekly tests. But in the experimental class, to whom the test was given on 
"Thursday of each week, the papers were returned and discussed on Friday. 
"Those who had made unsatisfactory scores were tested again over the same 
material after a brief review on Monday, while the others were excused. 
On the final examination the experimental group was above the control 
group, the advantage being largely with the pupils who were in the lowest 
fourth of the class and who had taken the retests. 

Three of the above experiments attempted to get the students’ attitude 
toward the frequent testing. By means of unsigned questionnaires in three 
classes, Jones found that 70 per cent of the students approved the “termi- 
nal review method.” In like manner, Turney discovered an “excellent at- 
titude" toward frequent testing in his experimental group; about 85 per 
cent thought they had studied more, and over 90 per cent said that they 
preferred to be in that section and that they felt they had learned more 
even if they had made no better grade. From “an extensive questionnaire 
touehing some thirty issues of educational theory and praetice," given at 
the opening and repeated near the end of the semester, Keys found: ‘‘With- 
out comment by the instructor or knowledge of the experiment in progress, 
students disclose a strong and growing conviction of the desirability of tests 
given as frequently as every second, third, or fourth class session." The 
evidence strongly suggests that students favor frequent testing. 

Awareness of final examination. To what extent is the “intention to 
remember" or "temporal set" a factor in learning? Will the expectation 
that the material will have to be recalled later influence the amount re- 
tained? More specifically, how will the awareness of a final examination 
affect the progress of learning and of forgetting? One or the other of the 
following “two rival and mutually exclusive hypotheses," as suggested by 
Remmers,” is apparently true: 

? Daniel H. Kulp, IT, “Weekly Tests for Graduate Students?” School and Society, 
38: 157-159, July 29, 1933. 


23 Sidney L. Pressey, Psychology and the New Education, pages 363-366. New York: 
Harper & Brothers, 1933. 


НВ Remmers and others, “Exemption from College Semester Examinations,” 
Purdue University Studies in Higher Education, 23: 11, November, 1933. 


MOTIVATION AND PRACTICE IN TESTING 313 


1, Exemption from final examinations with its requirement of continuous high- 
level learning provides. better motivation and therefore more permanent learning 
and integration than does the final examination. 

2. The final examination provides the opportunity and at least a part of the 
stimulation for the better development of certain abilities such as rapid organiza- 
tion of a large mass of material; the ability to select crucial data from the large 
mass of material; to see pertinent relationships, to reason in terms of the subject 
matter, to apply this reasoning to significant problems, etc.; and in general more 
effective and permanent learning. 

Thisted and Remmers* summarized the literature on the general prob- 
lem, including studies on such dissimilar materials as stories, objects shown, 
vocabulary, nonsense syllables, photographs, and stylus mazes. They con- 
cluded: “It is evident that a condition of expectation of recall, when 
injected into the initial instructions, has given variable and conflicting re- 
sults.” Their own study, which included 404 psychology students, involved 
learning Anglo-Saxon vocabulary and the factual content of two articles 
presented in mimeographed form under ordinary classroom conditions. The 
control group understood that they were to be tested immediately, and 
the experimental groups understood that they were to be tested later also, 
after three days in some cases, after one week and after two weeks in other 
cases. The experiments tended to establish a somewhat slower drop in the 
forgetting curve when a set to prepare for delayed recall was introduced. 

While the learning material in the above experiments was not left in the 
hands of the students, it is reasonably sure that one effect of the “temporal 
set" was to cause those who expected to have to recall the material later 
to give a “mental review” of what they could remember, as well as probably 
to exchange ideas with other students. Under ordinary school conditions 
one might expect that the effect of reviewing for an expected examination 
might be larger. However, Remmers' later found that exempting students 
in mathematics and applied mechanics made “relatively little difference 
in the amount, quality, or permanence of learning, at least as measured by 
current types of tests and examinations.” 

Pease?” reported some interesting studies of the effect of “cramming” 
on the amount of class materials retained. The first study included several 
classes—in all, 302 college students and 106 high-school pupils—separated 
into equivalent groups on the basis of intelligence. A test of 100 objective 
items was prepared for each class, “covering several months of the usual 
course work already completed by the students.” At a meeting of the class 
the purpose of the experiment was clearly explained. The experimental 
group in each class was then dismissed, with instructions to spend at least 
an hour in review for the examination which was to come at the next 
meeting of the class. The control group in each class took the examination 


" M. N, Thisted and H. H. Remmers, “The Effect of Temporal Set on Learning,” 
Journal of Applied Psychology, 10: 257-268, June, 1932. 


є. H. Remmers and others, up. cit., page 52. 
21 Glenn В. Pease, “Should Teachers Give Warning of Teste and Examinations?” 


Journal of Educational Psychology, 21: 273-277, April, 1930. 


314 MEASUREMENT IN INSTRUCTION 


at once without warning or review. The mean score of the experimental 
group exceeded that of the control group in each class, the average supe- 
riority being 11.1 points on the 100-item test. Without warning, the test 
was repeated six weeks later. The average lead of the experimental group 
had been reduced to 6.3 points. But there was still a significant difference 
for all classes containing as many as fifteen pairs of pupils. However, when 
one of these classes was retested after an additional six weeks, the lead of 
the experimental group was reduced by about half and was not then sig- 
nificant. After twelve weeks the lead in another class had been reduced 
from 17.07 points to 2.7 points, which was not a significant difference. 
These results had been produced by an amount of “cramming” by the 
experimental groups that represented about one and one half hours, on the 
average. It appeared probable that the time so spent yielded returns that 
averaged higher than the same amount of time spent either in class attend- 
ance or in regular preparation outside. Pease concludes that “from the 
standpoint of the student, it pays to cram." 

Tyler and Chalmers" studied the effect on test results of warning junior- 
high-school pupils that they would have a unit test in general science on 
the following day. The test scores of pupils so warned were compared with 
comparable pupils who had no specifie warning, although they were all 
aware that it was customary to have a test at the end of each unit, usually 
with the time announced at least two days in advance. All of the obtained 
differences favored the warned groups but by margins below the level of 
statistical significance. Six weeks later, when the tests were repeated, the 
differences had practically disappeared. The authors questioned whether 
junior-high-school pupils are really motivated to study for unit tests even 
when announced, or know how to study effectively when they try. To be 
effective, motivation has to be intelligently directed. 

White? conducted an experiment that bears directly upon the effect of 
exemption from a final examination. Three classes in general psychology 
which met once a week for seventeen weeks were divided, according to 
chance, into experimental and control groups. At each weekly class meeting 
both groups were given a “comprehensive, mimeographed true-false test 
covering the chapters studied for the period." From the outset the control 
groups understood that their marks in the course would be based solely 
upon these weekly tests, while the experimental groups understood that 
they were to have a final examination that would count 50 per cent toward 
their course marks. At the class meeting following each test the corrected 
papers were returned to all students, and they were allowed to keep the 
papers. The experimental groups were urged to preserve these test, papers 

28 F. T. Tyler and T. M. Chalmers, “The Effect on Scores of Warning Junior High 


ies M upils of Coming Tests," Journal of Educational Research, 37: 200-296, Decem- 
er, 1943. 


?? Hubert B. White, “Testing as an Aid to Learning," Educational Administration 
and Supervision, 18: 41-40, January, 1932. 


MOTIVATION AND PRACTICE IN TESTING 315 


for further study, as the final examination would contain exactly the same 
items. At the end of the seventeen weeks the final examination was given 
to both groups, the hearty co-operation of the students in the control 
groups being asked in order to determine the value of the experiment. The 
difference between the groups was 51.2 per cent, the experimental group 
having gained 31.6 per cent and the control group having lost 19.6 per cent. 
Even more convincing was the equal superiority of the experimental group 
on a completion test “with which they were wholly unfamiliar.” 

Knowledge of test scores. What is the effect upon the course of learn- 
ing of the knowledge of progress, afforded by test scores or by other means? 
The answer to this question has been sought many times in the psycholog- 
ical laboratory, with practically unanimous results. Psychologists are in 
substantial agreement with the conclusion of an early study? that “the 
addition of a single other motivating factor, namely, knowledge of results, 
is sufficient to give the pupils with such knowledge a distinct superiority 
over the others, and the degree of superiority is roughly proportional to the 
amount of information possessed.” However, as has been pointed out earlier 
in the chapter, experiments conducted in the classroom are far more con- 
vincing. We shall now take a look at what they have shown. 

One of the earliest and most comprehensive of these studies conducted 
under actual schoolroom conditions was that of Panlasigui. The findings 
were based on 358 pairs of pupils in fourth-grade arithmetic in ten cities. 
The practice material consisted of fifteen minutes’ drill in examples of the 
mixed type of fundamentals once a week for twenty weeks. As all pupils 
scored their papers after each drill, it can be seen that each pupil knew his 
achievement for the day, although this knowledge must be related to previous 
records in order to be a knowledge of progress, strictly speaking. In the ex- 
perimental classes the idea of progress was stressed, progress charts for 
both the individual and the class being kept in a conspieuous place. The 
teachers of the control classes, on the other hand, were instructed as follows: 
"Please keep very much out of class discussion any reference about how 
much pupils are scoring." The comparison appeared to be, then, between 
experimental classes with somewhat more stress on progress, and control 
classes with somewhat less stress on progress, than is customary to the 
ordinary teacher. On a comprehensive test the mean of the experimental 
classes exceeded that of the control classes by 11.34. A detailed examination 
of the results reveals the fact that this superiority is most in evidence in 
the highest quarter and practically non-existent in the lowest quarter. “The 
beneficial effect of awareness of success, then, was substantially in direct 
proportion to the amount of success available for motivation.” This is also 

9! Зее pages 306-307 for а brief description. qu. Я ' 

2 Isidoro Panlasigui, The Effect of Awareness of Success on Skill in Arithmetie, unpub- 
lished doctor's dissertation. Iowa City: University of Iowa, 1928. For a brief account, 


see: T'wenty-Ninth. Yearbook of The N ational Society for the Study of Education, pages 
611-619. Bloomington, Illinois: Public School Publishing Company, 1930. 


316 MEASUREMENT IN INSTRUCTION 


true of the drill periods themselves, where the accuracy standards of the 
highest fourth of the experimental group exceeded those of their controls 
eight times out of eleven, whereas the lowest fourth of the experimental 
groups fell behind their controls on every drill. This experiment seems to 
have established rather definitely two important points: 


1. A knowledge of progress in learning under classroom conditions is likely to 
have much less effect than that under laboratory conditions. 

2. A knowledge of progress is likely to be more beneficial to good students than 
to poor. 


Studies since reported have confirmed the first point, but most of them 
have not been analyzed with respect to the second point. 

Forlano* conducted a comprehensive series of experiments in grades 
four to eight, inclusive, involving in all 1,294 pupils, and touching upon 
various aspects of the problem of the effect on learning of a knowledge of 
results. The experimenter emphasizes the fact that these studies were made 
"in the normal classroom situation . . . as far as possible as a part of the 
daily school routine." He attempted to determine whether giving a knowl- 
edge of results immediately after the word had been spelled or an arithmetic 
fact had been studied was more effective than when a knowledge of results 
was withheld until an entire column of 20 or 24 items had been attempted. 
In other words, if one may use the analogy of the target range, Forlano 
was interested in finding out, so to speak, whether it was better to tell the 
marksman his score after each shot or to wait until he had fired a series 
of 20 shots. The author's conclusion is as follows:** 


The results of our experiments show that there is a tendency for learning during 
which the learner ostensibly receives immediate knowledge of results to be less 
efficient than learning in which knowledge of results is delayed. In general, it may 
be said that this superiority of the "delayed knowledge of results" method does not 
always approach statistical certainty. 


Even this modest conclusion, the author suggests, is limited by the fact 
that the methods employed "may not be ‘pure’ methods of what they pur- 
port to involve,” and that the apparent superiority of the delayed pro- 
cedure may be due to other causes. In any event, since the period of delay 
never exceeded five minutes, little light is shed upon the ordinary school 
situation, where the tests follow learning after an interval ranging from a 
day to a year or longer. 

Brown* reports an experiment in arithmetic in grades 5A and 7A. Both 
his procedure and his conclusions differed somewhat from those of Pan- 


32 George Forlano, School Learning with Various. Methods of Practice and Rewards, 
pages 55-114. New York: Bureau of Publications, Teachers College, Columbia Uni- 
versity, 1936. 

2° Ibid., page 99, 

34 Francis J. Brown, “Knowledge of Results as an Incentive in Schoolroom Practice,” 
Journal of Educational Psychology, 23: 532-552, October, 1932. 


MOTIVATION AND PRACTICE IN TESTING 317 


` lasigui. In grade 7A, Brown selected his experimental and control grades, 


which were only roughly equivalent, on the basis of an intelligence test, 
and in grade 5A on the basis of estimates of intelligence and achievement. 
The groups were reversed at the end of the first period of ten days. The 
drill period was eight to ten minutes daily. While the differences, on the 
whole, favored the experimental group, they were not very impressive. 
An examination of the individual drill periods reveals the fact that the 
progress from day to day in all groups was irregular and somewhat incon- 
sistent, and that the differences between experimental and control groups 
were generally less on-the tenth day than on the first. There was some 
evidence in Brown's study that the incentive was somewhat more effective 
with boys than with girls, but the outstanding fact was the remarkably 
small amount of influence, taken from any point of view, of a knowledge 
of progress in the elassroom as compared with the laboratory. 

Deputy” conducted in a state university a carefully planned experiment 
with three groups of students, of approximately equal intelligence, in fresh- 
man philosophy, which met twice a week. For six weeks during the first 
half of the semester the first ten minutes of each class meeting of the control 
group were devoted to an oral review of the preceding lesson. One of the 
experimental groups had a ten-minute objective test covering the same 
material, and the other experimental group had the same items in a twenty- 
minute test given once a week. Beginning at the middle of the semester, 
the group which had served as a control was given the ten-minute test at 
each class meeting, while the other two groups had only the oral reviews. 
The scores for the experimental groups were put on the board following 
each test, and each student was urged to keep a record of his progress. 
Only one of the three comparisons between the ten-minute written test and 
the ten-minute oral review showed the former to be superior by a statis- 
tically significant amount. This fact the author ascribed to a particularly 
favorable attitude on the part of the students. The experimental group 
which excelled happened to be slightly the most intelligent of the three, 
and also showed itself superior to the group which took the twenty-minute 
test once a week. Deputy’s most significant conclusion was: “Considerable 
precaution should he taken in applying principles, derived from laboratory 
and other non-classroom situations, to work in school subjects.” 

Two years later Ross* began à series of experiments which were to force 
him to this same conclusion. Attention has already been called to the 
earlier laboratory experiment? which had appeared convincing not only 
to the author at the time but also to many readers since that time, judging 


55 E, С. Deputy, “Knowledge of Success as & Motivating Influence in College Work," 
Journal of Educational Research, 20: 327-334, December, 1929. 

5 С. С. Ross, "The Influence upon Achievement of a Knowledge of Progress," 
Journal of Educationat Psychology, 24: 609-619, November, 1933, 

37 See Footnote 5, 


318 MEASUREMENT IN INSTRUCTION 


from the writers on educational psychology who have quoted it with ap- 
proval. ^ 

Upon the basis of a comprehensive examination given at the end of the 
first unit in a class in tests and measurements, a large class was divided 
into four substantially equivalent groups. A regular class test was given to 
all students once a week for the next two months. At the next class meeting 
following a test, a distribution for the entire class was put on the black- 
board and a brief discussion given of each item missed by any considerable 
number. But the four groups were given different degrees of information 
as to progress. One group was given no knowledge whatsoever as to its 
scores. A second group was given vague knowledge, each student being told 
merely that his score was “good,” “fair,” or “poor.” A third group was 
given partial knowledge, each student being told his point score but not 
allowed to see his paper. The fourth group was given full knowledge, each 
student being shown his paper at the close of the class and allowed to ask 
any questions he wished to ask regarding it. 

Figure 38 shows the results for the four groups in the form of cumulative 
Scores, week by week, for the first eight weeks and for the last four weeks, 


280 


100 2 і EG END 
h Full Knowledge 
80 j ---- ~ Partial Knowledge 
? —- Vague Knowledge 
60 ео No Knowledge 


о об Ьо АКО 19. 11.12 
TEST NUMBER 


Figure 38. The Influence of Knowledge of Progress Upon Achievement in a College 
Class, 


MOTIVATION AND PRACTICE IN TESTING 319 


when the groups were reversed. Nowhere was there a statistically signifi- 
cant difference between any of the groups. The experiment was repeated 
with two other classes in the same subject and with one in a different 
subject. Not content with this, Ross persuaded a colleague in another 
department to do the same experiment. In all, the groups involved more 
than 50 tests and about 300 students, and not once did there appear a 
difference, favoring the group with full knowledge of progress, that meets 
the minimum requirement for statistical significance. 

Two conclusions seem reasonably certain. The first, directly in line with 
that of Deputy, is as follows:* 

The Gestalt of the laboratory situation is so different from that of the life 
situation outside that it is hazardous to generalize from one to the other. One 
can never be certain what the outcome of a laboratory experiment will be when 
applied to the classroom situation until it has actually been tried out in that 
situation. 

The second conclusion is that most, if not all, experiments relating to 
knowledge of results in learning have involved another erroneous assump- 
tion: namely, that because students were not told their individual scores, 
they “had no knowledge of progress." Certainly they had their subjective 
impressions. To test out the accuracy of these impressions the author re- 
quested the students in the “no knowledge" group to estimate the scores 
they thought they had made when they turned in their papers at the close 
of the tests, The median coefficient of correlation between these estimates 
and the actual scores was .71. Manifestly, then, such studies involve a 
comparison of two kinds of knowledge, subjective and objective. Moreover, 
there was a tendency for the poorer students to overestimate their scores. 
Tn such cases the illusion of success may very well have proved more stim- 
ulating than the reality of failure. 

Knowledge of results combined with other incentives. It is prob- 
ably rare that a knowledge of progress operates alone. It is likely that 
such factors as rivalry and social recognition are always involved in some 
degree. But in the experiments so far reported, these other factors were not 
emphasized. In many experiments, however, the knowledge of progress has 
merely been taken as the occasion for utilizing other motives, such as praise 
and blame, rivalry, money or other rewards, and the like. 

At least two studies have attempted to use a knowledge of intelligence 
scores as an occasion for verbal suggestion and other forms of motivation. 
Mitchell? divided the lowest fourth of the freshman class in a high school 
into two equivalent groups on the basis of the Otis tests. Each pupil in the 
experimental group received the following notice, without further com- 
ment: 

38 C, С. Ross, “А Needed Emphasis in Psychological Research," Psychological Review, 


43: 197-206, May, 1936. : Fr. e j 
? Claude Mitchell, “Why Do Pupils Fail?" Junior-Senior High School Clearing House, 


9; 172-176, November, 1934. 


320 А MEASUREMENT IN INSTRUCTION 


Dear Pupil: 


Your score on the Intelligence Test which was given at the opening of school 
is LOW. This will mean that much work and effort on your part will be necessary 
to keep up with the class. Put yourself to the task and show that you can do it, 
YOU CAN IF YOU WISH. 


Principal 


At the end of the year it was found that 62 per cent of the group which had 
received this notice passed on all subjects, while only 15 per cent of the 
equally poor group, which had not been notified, did so. 

Ross? conducted a somewhat similar study with college students at the 
University of Kentucky. From the lowest fifth in intelligence, experimental 
and control groups of 40 freshman each were formed upon the basis of 
psychological tests, sex, and fraternity affiliation. The students in the ex- 
perimental group were then called together, and a frank statement was 
made regarding their scores. They were told that it was important at the 
outset to recognize the fact that they were up against.a somewhat different 
situation from that of the students with higher test scores. 'They were as- 


TABLE 33 


Ротмт STANDING ғов THE FIRST AND SECOND SEMESTERS FOR LOW-RANKING 
FnEsHMEN \Уно Were Тор THEIR INTELLIGENCE Test Sconks 
AS СомрАвир wiTH Tuose Wao Wers Мот 


First SEMESTER SECOND SEMESTER 
Point Arts & Sci. | Commerce Totat Arts & Sci. | Commerce Total 
STANDING val ie ЕВ MED й < 
Ехр.*|Соп.1| Exp. | Con. | Exp. | Con. | Exp. | Con. | Exp. | Con. | Exp. | Con. 
1.80-1.99 .. 1 1 n 
1.60-1.79 .. 1 1 1 1 2 3 2 | 3 
1.40-1.59 ..| 3 1 3 1 1 1 1 2 1 
1.20-1.39 ..| 4 2 1 5 2 2 2 3 5 | 2 
100-119..| 7 ipia 9 2]|4 1 i 
.80- .99..| 2 5 1 2 3 7 1 6 1 1 2 7 
.60- .79..| 4 5 3 3 1 8 3 3 5 ] 8 4 
40- 59..| 3 3 5 2 8 5 3 1 z 6 5 th 
.20- 39 3 2 7 2 |10 1 4 3 1 7 
00712197. |n 2 2 1 4 3 1 1 2 4 3 
Po M. 274 | 24 | 16 | 16 | 40| 40| 20| 21| 14| 13 | 34 | 84 
Mean...... -98 | .78 | .83 | .45 | .94 | .64 | .85 | 88 | .86 | ла | .85 | .69 
S.D. 36 | 41 | 46 | .25 | .41 | .39 | .50 | .49 | .45 | 22 | 45| 46 
Мь-Мс.... 20 .38 .30 —.03 .42 A6 
ЕО TUE: (TO NUT T MTS Т Y ELIO ВОЈ 


* Experimental group. 
t Control group. 


^C. C. Ross, “Should Low-Ranking College Freshmen Be Told Their Scores op 
Intelligence Tests?” Schoor and Society, 47: 678-680, May 21, 1938. 


MOTIVATION AND PRACTICE IN TESTING 321 


sured, however, that the experience at the University showed that such 
students could succeed, if they were willing to work and did not attempt 
too heavy a load in school or too many activities outside. The control group 
had no advance information. 

The record of the two groups is summarized in Table 33. The mean point 
standing of the experimental group was .94 for the first semester and .85 for 
the second semester, while the corresponding values for the control group 
were .64 and .69, respectively. During the first semester three times as 
many students in the experimental group as in the control group made à 
point standing of 1.00 or better, and more than twice as many made this 
standing the second semester. Approximately twice as many experimental 
as control students passed all subjects. On the whole, the difference was 
more marked for the first semester than for the second, and was decidedly 
greater for the College of Commerce than for the College of Arts and 
Sciences. These two studies offer rather convincing evidence that knowledge 
of intelligence test results may have a motivating effect on low-ranking 
freshmen in high school and college. More recent studies have tended to 
confirm these findings.*t 

A great many more studies have utilized achievement test scores as oc- 
casions for various types of motivation. In an early study Book and Nor- 
vell? used a knowledge of results in four laboratory experiments as a basis 
for building morale or developing the “will to learn.” For example, students 
in the experimental groups “were frequently told that if they would only 
make up their minds to increase their score they would somehow find a way 
to do it," while at the same time the “method of measuring their output 
and having them keep track of their score usually convinced them that this 
was true." Their data support the conclusion that this “special group of 
incentives" help the experimental group to “make more improvement with 
a given amount of practice than do the control groups." But it is impossible 
to tell just how important a knowledge of results by itself would have been. 

An experiment by Hurlock,” which utilized test results as occasions for 
praise and reproof, attracted considerable attention. The subjects were 
106 children in fourth- and sixth-grade arithmetic. The groups were equated 
on an initial practice period of fifteen minutes. Four more practice periods 
were held on successive days. The control group received the tests without 
comment. The praised group had their names read aloud at the beginning 
of each practice period. They were then called to the front of the room and 

а Cf. В. К. Compton, “Student Evaluation of Knowing College Aptitude Test Score,” 
Journal of Educational Psychology, 32: 656-664, December, 1941. P 

Edna E. Lampson, *How Objective Can Freshmen in College Be toward Objective 
Evidence of Their Ability and Achievement?" Educational Administration and. Super- 
vision, 28: 280-290, April, 1042. Я 

#2 William F. Book and Lee Norvell, "The Will to Learn: An Experimental Study of 
Incentives in Learning," Pedagogical Seminary, 29: 305-362, December, 1922. 

43 'lizabeth B. Hurlock, “An Evaluation of Certain Incentives Used in School Work,” 
Journal of Educational Psychology, 10: 145-159, March, 1925. 


322 MEASUREMENT IN INSTRUCTION 


received praise combined with exhortation to do still better work. Then the 
names of the children in the reproved group were called, and they were 
severely reproved for poor work, carelessness, and general inferiority. The 
ignored group heard what was said to the others, but they received no 
recognition whatsoever. The results are shown in Figure 39. After the first 


15 


AVERAGE SCORE 
S 


2 4 5 


3 
DAYS 
Figure 39. A Study of the Influence of Praise and Reproof upon Achievement in 

Fourth-Grade and Sixth-Grade Arithmetic. (After Hurlock.) 


day reproof seemed far less effective than praise, although somewhat better 
than being ignored altogether. The control group made no progress what- 
soever. It is to be regretted that this experiment was not continued for 
several days longer. Manifestly an hour’s total working time is insufficient 
to establish fully the comparative merits of these incentives as they would 
operate day after day in the ordinary classroom. 

In a somewhat similar experiment in the same grades, Hurlock studied 
the effect of group rivalry on addition, The control group took their tests 
for ten minutes on four days without comment. The experimental group 
was divided into two equivalent subgroups which were pitted against each 
other. The author emphasized the fact every day that the two groups “were 
absolutely equal, and that one had as much chance to win as the other.” 
Although the effect of rivalry was present in all types of pupils, it was most 
marked in younger pupils and in inferior pupils. Increase in accuracy was 
much less than increase in speed, with some tendency for increase in speed 


** Elizabeth B. Hurlock, “The Use of Group Rivalry as an Incentive," Journal of 
Abnormal and Social Psychology, 22: 278-290, October-December, 1927. 


MOTIVATION AND PRACTICE IN TESTING 323 


to be accompanied by reduction of accuracy. It is well to keep in mind 
Thorndike’s warning that “the attainment of active rather than passive 
learning at the cost of practice in error may often be a bad bargain." * 

Another study“ shows that repeated applications of praise or blame may 
have different effects on introverted and extroverted pupils. Introverted 
fifth-grade pupils improved faster in number cancellation exercises when 
praised than did either introverts who were blamed or extroverts who were 
praised. However, extroverted pupils when blamed improved faster than 
extroverts who were praised or introverts who were blamed. Unfortunately 
one cannot safely conclude from a study which involved a total practice 
time of three minutes upon highly artificial tasks that the same differences 
would necessarily appear under ordinary school conditions. The problem 
is worthy of further experimentation. 


II. The Relation of Measurement to the Type of Learning 


Closely related to the amount and quality of learning is the type of learn- 
ing, or the learning procedure which is employed. There is considerable 
evidence for thinking that effective work or study habits of the student are 
of fundamental importance in learning. A question of major importance, 
therefore, is: to what extent does the type of measurement used influence 
the type of study technique employed by the student? Some important 
studies bearing on this question have been conducted on the college level. 

In a pioneer study, Terry“ found that 236 students in educational psy- 
chology were “influenced to a significant extent by the type of examination 
for which they were preparing. The most striking characteristic of the 
methods employed in preparing for an objective test which had been an- 
nounced a month in advance was the students’ emphasis on details, while 
they tended to study for large units of subject matter when they were 
preparing for an essay examination announced for the next month. Doug- 
lass and Tallmadge‘: reported similar results at the University of Min- 
nesota. They found that the “objective type focuses attention upon details 
and efact wording, while the subjective type apparently favors methods 
involving organization, perceiving relationships and trends, and personal 
reactions." 

There also appear to be significant differences among the various forms 
of the so-called new-type examinations in their effect on study methods. 


4° Edward L. Thorndike and others, The Dey of Want Interests, and Attitudes. 
New York: D. Appleton-Century Company, 1935. Page 147. к 

46 George G. ione and Clarence W. Hunnicutt, “The Effect of Repeated Praise 
or Blame on the Work Achievement of ‘Introverts’ and *Extroverts'," Journal of Edu- 
cational Psychology, 35: 257-266, May, 1944. ) 

47 Paul W. Terry, "How Studente Review for Objective and Essay Tests," Elementary 
School Journal, 33: 592-603, April, 1933. 

48 Harl В. Douglass and Margaret Tallmadge, i 
for New Types of Examinations,” School and Society, 


“How University Students Prepare 
39; 318-320, March 10, 1934. 


324 MEASUREMENT IN INSTRUCTION 


"Terry * found, for example, that the one predominant method of preparing 
for completion tests emphasized the word-for-word mastery of statements 
considered important, while preparing for true-false tests involved methods 
which dealt primarily with definitions and detailed facts such as the authors 
and findings of experiments. The author's conclusion points out an impor- 
tant educational implication: 


The kind of test to be given, if the students know it in advance, determines in 
large measure both what and how they study. The behavior of students in this 
habitual way places greater powers in the teacher's hands than many realize. By 
the selection of suitable types of tests the teacher can cause large numbers of his 
students to study, to a considerable extent at least, in the ways he deems best for a 
given unit of subject-matter. 


Meyer“ conducted a careful laboratory experiment with 124 psychology 
students to determine the relation between the specific examination-set and 
immediate memory and delayed memory after five weeks. When the amount 
of study was held constant, the method and results appeared to be largely 
dependent on whether the set was for recall or for recognition tests. It ap- 
peared that when students expected completion tests they studied with 
more effort than they would have put forth for recognition tests. More 
students made summaries and maps and otherwise attempted to obtain a 
general picture of the material when they expected essay examinations than 
otherwise. Meyer points out four practical implications: 


1. Since it is more economical, when a given amount of time is spent in study- 
ing, to use a recall examination set for delayed recognition or immediate and 
delayed recall tests, recognition questions should be used in testing only when 
they form a part of the entire examination or when students are unaware that 
such questions are to be used exclusively. 

2. If the teacher feels it necessary that the students be able to recognize certain 
materials for a short time only, then the indications are that a recognition exami- 
nation set may be used. This means that the teacher must evaluate the material 
in his course very carefully, since recognition tests, if given indiscriminately, may 
have a deleterious effect on what the students ultimately retain of the course. 

3. If the teacher feels it necessary that the students be able to recall tsolated 
facts when specific cues are given as to the fact wanted, a completion examination 
set may be used with profit, 

4. If the teacher wants the students to recall the material in an organized fashion 
and to know facts when cues are not given, the essay examination set should be 
used in preference to any objective type of examination set. Here again the teacher 
must evaluate the material which he presents in the light of what the student 
should learn from the course. * 


*? Paul W. Terry, “How Students Study for Three Types of Objective Tests," Journal 
of Educational Research, 27: 333-343, January, 1934. 

* George Meyer, “Ап Experimental Study of the Old and New Types of Examina- 
tion,” Journal of Educational Psychology, 25: 641-661, December, 1934; and 26: 30-40, 
January, 1935, 


MOTIVATION AND PRACTICE IN TESTING 325 


The following quotation from Monroe*' suggests that the nature of the 
examinations emphasized by the teachers may influence the students’ re- 
actions much more than the objectives of the course: 


There has been much discussion of the importance of teachers formulating their 
objectives and, in response to the pressure of authority, they have spent many 
hours in formulating lists of immediate objectives, that is, the goals toward which 
students are expected to direct their efforts. Many of these lists merit commenda- 
tion, but their influence upon students is practically nil in comparison with the 
influence of the tests administered. Students direct their efforts toward becoming 
able to respond to the tests they anticipate. 


D. Some Educational Implications of Motivation Studies 


Much of the experimental evidence on motivation has been fragmentary, 
some of it contradictory, and hardly any of it conclusive. But a few gen- 
eralizations appear to have been fairly well established. 

Implications for educational theory. In the first place, there is grave 
danger of premature and unwarranted generalizations in psychology and 
education. That it is hazardous to generalize from the laboratory experi- 
ment to the classroom application has been demonstrated in motivation 
experiments again and again. It is also dangerous to generalize from one 
age level to another. This is one of the greatest limitations of much of the 
experimental work on motivation. There is a great need for comparing 
the results of experiments made on the college level with results obtained 
from pupils on the elementary and secondary school levels. 

In the second place, there are no fixed motivating categories such as 
knowledge of results, praise and blame, rewards and punishments, et cetera. 
Brenner states this point well: 


The truth seems to be that there do mot exist such psychological entities but 
that they do act in specific situations, depending upon all the factors of the situation 
as a whole. What in one situation may constitute praise, under certain other cireum- 
stances will be considered blame. The incentives derive their attributes, so to 
speak, from the situation in which they are active. 


Implications for educational practice. Three points require brief 
mention. In the first place, the measurement program of the school influ- 
ences both the teacher and the learner. It affects teaching emphasis and 
curriculum content, as well as the amount and quality of learning and the 
procedure employed. In the second place, no motivating factor operates 


м Walter S. Monroe, “Some Trends in Educational Measurement," Twenty-Fourth 
Annual Conference on "Educational Measurements, page 32. Bulletin of the School of 
Education, Indiana University, Vol. XIII, No. 4. Bloomington, Indiana: Bureau of Co- 


operative Research, 1937. è ‘ 
ы Benjami Brenner, Effect of Immediate and Delayed Praise and Blame upon Learning 
and Recall, pages 48-49. New York: Bureau of Publications, Teachers College, Columbia 


University, 1934. 


326 MEASUREMENT IN INSTRUCTION 


universally. Both Chase and Hurlock, for example, found young children 
more susceptible than older children to the motivation used. In general, 
praise seems more effective with the duller and socially inferior groups. 
Frequent testing also seems most helpful to weaker pupils. On the other 
hand, there is some evidence that blame and knowledge of results are more 
effective in the stronger groups. Even in similar age and social groups, 
however, marked individual differences appear as to the relative effective- 
ness of different types of motives, or even as to the effectiveness of the same 
motive used at different times. Brenner? warns against a 

stereotyped habit of motivation, for instance, always praising the children, always 
smiling and appearing pleased. This form of mechanized motivation is not adequate 


for increasing the performance of children, and it is doubtless harmful in its in- 
fluence upon character building in children. 


In the third place, no motivating factor operates automatically. Test 
Scores, at best, merely provide an occasion for praise or blame, reward or 
punishment, or some form of social recognition. The strategie place of the 
teacher is nowhere more in evidence than in motivation. In a fundamental 
sense, the role of the teacher is to stimulate and guide the learning process. 
Perhaps Brenner's concluding statement? does not put the matter too 
strongly : 

The facts about the usefulness of a motive in a certain learning situation will be 
furnished by educational psychology, but proper application of the incentive in & 


given situation depends upon the insight of the teacher. The effectiveness or worth 
of a teacher depends upon his ability to make adequate use of motivation. 


E. Practice Effect 


'The whole question of practice is intimately related to learning in gen- 
eral. One special aspect, practice effect on repeated tests, has received con- 
siderable attention. Several standardized tests, such as the American 
Council on Education Psychological Examination, contain short pretests 
which help the examinee “warm up" and acquire the proper set for the 
subtest that follows. Some examiners prefer to preface a testing session with 
easy practice material to “cushion” inexperienced testees and thereby put 
them more nearly on an equal footing with *test-wise" students. Even the 
effects of coaching on highly similar material may be overestimated, how- 
ever, as Dyer* has demonstrated quite well with preparatory-school stu- 
dents. 

53 Ibid., page 50. 

54 Thid., page 50. 

55 Both high-school and college-freshman forms are published by the Cooperative 
Test Division of Educational Testing Service. 

55 For a surprising by-product of this procedure, see Scarvia B. Anderson, “Prediction 
and Practice Tests at the College Level," Journal of Applied Psychology, 37: 256-259, 
August, 1953. у 


57 Henry S. Dyer, “Does Coaching Help?" College Board Review, No. 19: 331-335, 
February, 1953. 


MOTIVATION AND PRACTICE IN TESTING 327 


Nevertheless, it is undoubtedly true that the individual who takes a 
standardized examination for the first time in competition with experienced 
examinees is handicapped, especially if the test involves speed, complexity, 
and novelty. 


SELECTED REFERENCES FOR FURTHER READING 


Cane, V. R., and Heim, Alice W., “The Effects of Repeated Retesting: III. Further 
Experiments and General Conclusions,” Quarterly Journal of Experimental Psy- 
chology, 2: 182-197, November, 1950. 

Cook, Walter W., “The Functions of Measurement in the Facilitation of Learning,” 
Chapter 1 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Current Theory and Research in Motivation—a Symposium. Lincoln: University of 
Nebraska Press, 1953. 193 pages. Articles and comments by Judson 8. Brown, 
Harry F. Harlow, O. Hobart Mowrer, Theodore M. Newcomb, Vincent Nowlis, 
and Leo J. Postman. 

Hilgard, Ernest R., T'heories of Learning. New York: Appleton-Century-Crofts, 
Inc., 1948. 409 pages. 

Hilgard, Ernest R., and Russell, David H., “Motivation in School Learning," 
Chapter II in “Learning and Instruetion," Forty-Ninth Yearbook of the National 
Society for the Study of Education, Part 1. Chicago: University of Chicago Press, 
1950. 

MacKinnon, Donald W., “Fact and Fancy in Personality Research,” American 
Psychologist, 8: 138-146, April, 1953. 

McClelland, David C., “The Measurement of Human Motivation: An Experi- 
mental Approach," pages 41-56 in Proceedings of the 1952 Invitational Conference 
on Testing Problems. Princeton, New Jersey: Educational Testing Service, 1953. 

McGeoch, John A., and Irion, Arthur L., The Psychology of Human Learning 
(Second Edition). New York: Longmans, Green and Company, 1952. Chapter VI, 
“Learning as a Function of Motive-Incentive Conditions.” 

National Council of Teachers of Mathematics, The Learning of Mathematics, I ts 
Theory and Practice, Twenty-First Yearbook. Washington, D. C.: The Council, 
1953. Chapter II by Maurice L. Hartung, “Motivation for Education in Mathe- 
matics,” and Chapter VI by Ben A. Sueltz, *Drill—Practice—Recurring Ex- 
perience.” 

Peel, E. A., “A Note on Practice Effects in Intelligence Tests,” British Journal of 
Educational Psychology, 21: 122-125, June, 1951. $ 

Peel, E. A., “Practice Effects between Three Consecutive Tests of Intelligence," 
British Journal of Educational Psychology, 22: 190-199, November, 1952. 

Pressey, 8. L., “Development and Appraisal of Devices Providing Immediate 
Automatie Scoring of Objective Tests and Concomitant Self-Instruction, 
Journal of Psychology, 29: 417-447, April, 1950. : 

Tyler, Ralph W., “The Functions of Measurement in Improving Instruction," 
Chapter 2 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951, 


12 


Diagnosis 


A. The Problem of Diagnosis in Education 


The nature of educational diagnosis. Educational diagnosis seeks 
to determine the nature and causes of unsatisfactory adjustment to the 
school situation. It is concerned with the specific weaknesses of individual 
pupils. Diagnosis seeks not so much to describe or explain educational 
maladjustment as to correct or prevent it. Adequate diagnosis is the basis 
of intelligent guidance and of effective teaching. 

Education borrowed the term “diagnosis” from medicine, where its fun- 
damental character has been long recognized. Medical diagnosis commonly 
starts with some bodily symptom, such as pain or abnormal temperature. 
The next step is to determine the causes that lie behind the symptoms. 
The trouble may be the malfunctioning of some organ or gland, which in 
turn may be caused by some particular germ or toxic condition, and which, 
when located, may yield readily to the appropriate medical treatment or 
surgery. The order of events is clearly indicated by the rule: “Before you 
dose, diagnose!" 

The situation in education is much the same, although here the scope 
of diagnosis is usually broader. At times educational difficulties can be 
traced to some organic defect, such as imperfect vision or hearing, or some 
glandular disorder, but educational diagnosis is more often concerned with 
functional disorders rather than organic. Pupils who are perfectly normal 
organically may experience great difficulty with various aspects of the 
school situation. It is a matter of common knowledge that many serious 
learning difficulties arise, not so much from structural defects as from other 
factors, such as faulty habit-formation, lack of interest, or a poor home 
environment, Despite these complications an outstanding educator has as- 

328 


—— — la——————)—— ———M o 


DIAGNOSIS 329 


serted that “experts in reading, arithmetic, and spelling can now make 
diagnoses no less valid and reliable than are most diagnoses in medicine," 

Furthermore, the learning process at any time is usually conditioned by 
many factors, both inside and outside the learner. It is rarely possible to 
isolate a single causative factor analogous to the disease germ in medicine, 
but the various factors may be classified roughly as follows: 

1. Internal factors: 

a. Physical: sensory equipment, glandular balance, health status, stage of 
maturity level, ete. 

b. Intellectual: general intelligence, specific talents and deficiencies, etc. 

c. Emotional: attitudes, interests, drives, prejudices, feelings of inadequacy, 
ete. 

d. Educational: background, work-habits, ete. 

2. External factors: 

а. School environment: educational program, teacher, playmates, equipment, 
ete. 

b. Extraschool environment: home, community, church, recreational facili- 
ties, etc. 

The scope of educational diagnosis has also increased to keep pace with 
the growing concept of education. When the conventional school conceived 
of its function rather narrowly in terms of certain academic knowledge and 
skills, the scope of diagnosis was likewise limited. Now that the modern 
school has enlarged the concept of education to make it synonymous with 
the growth of personality, it is no longer proper to limit the scope of diag- 
nosis to locating the causes that interfere with the ordinary academic prog- 
ress of the pupil. The learning difficulties presented by the school curricu- 
lum will doubtless always constitute an important part of any program of 
diagnosis. In fact, this phase of diagnosis naturally increases in scope and 
importance as the objectives of the various school subjects are extended 
to include the less tangible outcomes, such as attitudes, interests, apprecia- 
tions, tastes, and standards of judgment. But some of the most important 
and difficult aspects of diagnosis have to do with social adjustments and 
personality disorders of many kinds. 

It is likewise apparent that the scope of diagnosis is much larger than 
the use of tests and examinations. This, of course, does not mean that tests 
have an unimportant place in educational diagnosis. On the contrary, an 
adequate diagnosis may involve the use of intelligence tests, both general 
and specific, and of diagnostic achievement tests, both standardized and 
teacher-made, as well as the use of various pieces of laboratory apparatus 
for measuring sensory acuity, co-ordination, and the like. In addition to 
many kinds of tests, reliance must be placed upon other forms of appraisal, 
such as rating scales, uncontrolled observation, questionnaires, and inter- 
views. Important as are the ordinary forms of measurement in diagnosis, 


1 William A. Brownell, “Quantitative Research on Learning and Teaching,” School 
and Society, 50: 851, December 30, 1939. 


330 MEASUREMENT IN INSTRUCTION 


they are often by themselves insufficient. Keys has well stated the role of 
intelligence tests in diagnosis:? 


Few psychologists today look to an individual's score on an intelligence test, 
alone and of itself, to determine the source of his difficulties or indicate the exact 
solution to his problems. It is entirely probable, however, that the outcome of such 
a test, judiciously chosen and competently administered, will contribute as much 
if not more to sound clinical appraisal than any other single fact obtainable. 
Properly supplemented with other diagnostic procedures, the information thus 
derived is virtually indispensable to intelligent attack upon a wide variety of 
problems. 


The importance of educational guidance in the modern school arises from 
two facts: (1) many pupils make unsatisfactory progress in school—some 
fail altogether and others achieve little; and (2) few causes of maladjust- 
ment lie on the surface or are self-evident. 

It should be noted that up to the present time most tests designed spe- 
cifically for diagnostic purposes have been for the elementary school. As 
long as the secondary school and college had highly selected student bodies, 
their need for diagnostic tools was less acute. In recent years the enlarged 
enrollments at these higher levels of education have greatly increased the 
need for diagnosis. 

The value of diagnosis in education. There is an abundance of experi- 
mental evidence to show the value of educational diagnosis combined with 
the appropriate remedial measures. Such evidence is available on all levels 
of instruction and in a variety of subjects. Science has added confirmation 
to the verdict of common sense: it really helps to “put the oil where the 
squeak is.” For example, Baker? found that four months’ special coaching 
of sixty nine-year-old pupils from seven Detroit schools resulted in a gain 
of about seven months in educational age. The coaching consisted of two 
thirty-minute periods per week devoted to the subject or subjects in which 
the pupil had shown weaknesses. Scruggs* compared the improvement of 
two equivalent classes of fifth-grade Negro children in Kansas City, one 
of which had the ordinary group instruction in handwriting and the other 
an equal amount of corrective practice based upon a detailed analysis of 
the weaknesses of each pupil. In seven weeks the second group increased 
the average quality of its handwriting about twice as much as the first. 
In a similar study Guiler" found that fourteen seventh-grade pupils made 


* Noel Keys, “Applications of Intelligence Testing,” Review of Educational Research, 
8: 256, June, 1938. 

з Harry J. Baker, Educational Disability and Case Studies in Remedial Teaching, 
page 65. Bloomington, Illinois: Publie School Publishing Company, 1929. 

* Sherman D. Scruggs, “Remedial Teaching for Improvement in Handwriting,” Jour- 
nal of Educational Research, 23: 288-295, April, 1931. 

5 Walter Scribner Guiler, “Improving Handwriting Ability," Elementary School Jour- 
nal, 30; 56-62, September, 1929, 


DIAGNOSIS 331 


in three months a normal gain of three years in quality of handwriting. 
Blair* has summarized studies in the tool subjects on the secondary level 
which show similar results. б 

It has been shown that the value of such remedial measures is by no 
means confined to skill subjects, such as handwriting and spelling. For 
example, a study by Leonard? showed that junior-high-school pupils im- 
proved more rapidly in the ability to write compositions free from common 
errors in capitalization and punctuation during a program involving error 
analysis and appropriate remedial exercises than did pupils of like ability 
exposed to the conventional method of teaching. While both groups showed 
definite improvement, the mean decrease in the twenty-eight most frequent 
errors, after eleven forty-five-minute practice periods, was approximately 
twice as great for the experimental as for the control groups. Experiments 
by Guiler? on the elementary-school, the senior-high-school, and the college 
levels showed comparable results from similar methods. 

Stone? found that pupils in the fifth and sixth grades in twenty-three 
schools, who devoted not more than forty minutes a day for five weeks to 
diagnostic and practice tests, gained two to six times as much in ability 
to solve reasoning problems as did pupils of equal ability who had only the 
regular arithmetic work in school. Furthermore, the results of the study 
indicated that the superior gain in reasoning ability resulting from this 
diagnostic and remedial program was about twice as great for pupils in the 
highest sixth in intelligence as for those in the lowest sixth, that the gain 
transferred to problems of a different content, and that it persisted for at 
least a year, at the end of which the retests were given. 

The psychology of such procedures seems reasonably clear. It is a sound 
principle of teaching which holds that learning always begins where the 
learner’s present knowledge leaves off. Failure to observe this principle 
results in foolish attempts to do two impossible things. One of these is 
attempting to teach a pupil what he already knows. The other is attempt- 
ing to teach him on a level too far beyond his present knowledge. Both are 
equally futile. The only adequate safeguard to be obtained is in frequent 
check-ups on the pupil’s progress. 


в Glenn Myers Blair, Diagnostic and Remedial Teaching in Secondary Schools, 422 
pages. New York: The Macmillan Company, 1946. ths 

7], Paul Leonard, “The Use of Practice Exercises in Teaching Capitalization and 
Punctuation,” Journal of Educational Research, 21: 186-190, March, 1930. 

з Walter S. Guiler: “Improving Instruction in English Mechanics in the Elementary 
School,” Elementary School Journal, 34: 427-437, February, 1934; “Improvement and 
Permanency of Learning Resulting from Remedial Instruction,” School Review, 41: 
450-458, June, 1933; and “Remediation of College Freshmen in Sentence Structure,” 
Journal of Educational Research, 26: 110-115, October, 1932. — Mia 

°C. W. Stone, “Ап Experimental Study in Improving Ability to Reason in Arith- 
metic,” Twenty-Ninth. Yearbook of the National Society for the Study of Education, pages 
589-599. Bloomington, Illinois: Public School Publishing Company, 1930. 


332 MEASUREMENT IN INSTRUCTION 


B. The Techniques of Diagnosis 


The levels of diagnosis. The process of educational diagnosis may be 
profitably thought of as falling into five steps, or levels. Figure 40 is a 
graphical representation of the process. It will be noted from the questions 
asked at each level that the first four steps—the W's—have to do with 


Figure 40. The Five Levels of Educational Diagnosis. 


corrective diagnosis, while the highest level has to do with what may be 
termed preventive diagnosis. In other words, the immediate purpose is 
correction, but the ultimate purpose is prevention. 

Locating the individuals needing diagnosis. How can we best: locate 
the pupils not making satisfactory adjustment to the school situation? 
This is logically the problem with which the program of educational diag- 
nosis begins. The order of events is not unlike that described in the famous 
recipe for making rabbit stew which begins: “First you catch your rabbit.” 
Strictly speaking, however, while it is a necessary preliminary step, it is 
hardly a part of the actual process of diagnosis. 

Various ways of locating the individuals who require diagnostic study 
have been used. Survey and group intelligence tests are often employed to 
sereen those whose achievement is unsatisfactory. Using this method Wil- 
son" found that about 70 per cent of the pupils in the seventh and eighth 
grades of fifteen representative cities and towns in the metropolitan area 
of Boston needed corrective instruction in the fundamental arithmetic 
processes. 

Several writers suggest that any pupil whose level of achievement is well 
below his level of intelligence is worthy of special study. Others contend 
that a practical difficulty with the procedure is that tests of achievement 
and so-called tests of intelligence really largely measure the same thing, 
and suggest instead that diagnostic study be given to those pupils whose 
achievement in some school subject, or subjects, is well below their general 
achievement level. Still other writers rely heavily upon the judgment of the 

See Guy M. Wilson, “Corrective Load in the Fundamentals of Arithmetic in 


Grades VI, VII, and VIII,” pages 234-241 in The Role of Research in Educational 
Progress. Washington, D. C.: American Educational Research Association, May, 1987. 


DIAGNOSIS 333 


teachers. Baker," for example, selected his sixty pupils for special remedial 
coaching by taking those who had received final marks of failure or con- 
ditional passing in four fundamental subjects. He admits that this criterion 
was used at the outset primarily because of its availability, but states that 
it “arose steadily in our esteem.” 

All these suggestions have merit. The judgment of the present teacher 
should always be taken into account, especially since in the ordinary school 
whatever diagnostie and remedial work is attempted will be undertaken 
by the regular classroom teacher. But the present teacher's judgment needs 
to be supplemented by considering the judgment of past teachers as re- 
flected in the school record. Since the judgment of teachers is not infallible, 
however, general achievement tests and intelligence tests will be found 
particularly valuable. Any pupils in the intermediate grades whose achieve- 
ment falls a year or more below their age or grade level should usually 
merit some study. Discrepancies between achievement and intelligence are 
of particular significance when intelligence has been measured by individual 
tests or performance tests rather than by ordinary group tests. Such dis- 
crepancies also assume added significance when the pupil has apparently 
had ample opportunity for learning. 

While special study and treatment are often justified for the lowest 5 or 
10 per cent in the typical class, it must not be thought that diagnosis should 
be restricted to low-ranking pupils and to obvious misfits. On the contrary, 
some of the most profitable cases are those whose achievement is average 
or even above, but is nevertheless well below what appears possible. As a 
matter of fact, Hildreth” points out that many clinics prefer not to at- 
tempt remedial work with very dull pupils, say those with IQ's of approxi- 
mately 80 and below, but prefer instead to alter the achievement goals for 
such children. It will be found at times that pupils whose personality de- 
fects interfere with satisfactory social adjustment have superior academic 
achievement. In fact, psychiatrists point out that the teacher should often 
be most concerned about the mental health of those who give her least 
concern academically. The writer recalls the case of a sixth-grade girl whose 
scholastic achievement was well above the norms on the tests but whose 
attempts at social adjustment to the group had been distinctly unsuccess- 
ful. The girl told her mother that she would give anything in the world 
if she had just one friend. In the conventional school this girl would have 
been regarded as making an entirely satisfactory record, but in the modern 
school she is seen to be so seriously maladjusted as to require special 
attention. . 9 

Locating the nature of the difficulty. After locating the pupils who 
are experiencing trouble, the next step is to make a careful examination 


и Harry J. Baker, op. cit., ра 9-16. E. А i 
12 Gende Hildreth, Pel. Three Ез (Second Edition), page 545. Philadelphia: 


Educational Publishers, Inc., 1947. 


334 MEASUREMENT IN INSTRUCTION 


of the difficulty of each pupil. A bill of particulars is needed. It is just here 
that diagnostic tests, if available, are of great value. The aim of such tests 
is to reveal the specific location of the pupil's difficulties. As a rule, each 
test has a limited scope, but attempts to explore thoroughly this restricted 
area. For example, one test might undertake to find the particular number 
combinations which are causing trouble in the addition of whole numbers, 
while another test attempts to find out whether inadequate reading ability, 
faulty technique of analysis and procedure, lack of skill in the fundamental 
processes, or some other factor is responsible for poor performance in rea- 
soning problems. 

Most of the diagnostie tests published to date are limited to the tool 
subjects mainly on the elementary level. Traxler prepared a comprehen- 
sive bibliography of available tests together with a practical discussion of 
their effective use. Blair“ compiled similar information with special ref- 
erence to the high school. Traxler” offered this warning: “Our experience 
at the Educational Records Bureau indieates that, at present, there is 
scarcely one test which gives us as much reliable information as is needed 
for effective diagnosis in any one field." 

But any test, whether standardized or not, can be used to reveal the 
location of errors. The principal advantages of the standardized test are 
that in content it is likely to represent a more careful selection than the 
informal test, and that the existence of comparable forms makes it possible 
to verify the accuracy of diagnosis based on one form and to check upon 
the success of any remedial measures undertaken. However, these special 
values in standardized tests by no means rule out the values of informal 
tests when used for diagnostic purposes. In reading, for example, some 
writers regard informal tests as even more important than standardized 
tests. The diagnostic value to be realized depends more upon the teacher 
than upon the test used. Durrell estimates that at least 75 per cent of the 
cases requiring special attention in reading can be handled adequately by 
well-trained classroom teachers using non-standardized tests supplemented 
by observation of the pupils' achievement and work habits. He says: 


Such informal tests and observation charts usually indicate the correct level on 
which to start remedial instruction, the specific reading abilities in which the child 
is weak, and the faulty habits and confusions which must be overcome in the 
remedial program." 


Figure 41 illustrates a useful procedure for analyzing the errors revealed 


18 Arthur E. Traxler, The Use of Test Results in Diagnosis and Instruction in the Tool 
Subjects, 80 pages. New York: Educational Records Bureau, 1942. 

M Glenn Myers Blair, op. cit. 

15 Arthur E. Traxler, “Individual Evaluation," in New Directions for Measurement 
and Guidance, page 28. Washington: American Council on Edueation, 1944. 

18 Donald D. Durrell, Improvement of Basic Reading Abilities, page 18. Yonkers: 
World Book Company, 1940. 

17 Ibid., page 296. Quoted by special permission. 


aoqo120 ur svi?) opsan-ujur v 20 'sqpyuourmpunj 


NSXEISBBRESRSSS 


~ ом ыю | он олњ о чо 


4 

8I OI 
go Y 
y» € 
6, 
6 3 


90020 ооо Ory 


ее 
ки 
neo 
roro 


ои 


“(PAUMO JY SUAJ 1Z 181) 
nuy ‘у uuo ‘ғәр, 1uouroAomqoy изуо4о9 у ‘g 3591, Jo 1994$ sisk[ouy "Lp әзі 


D IMEEM ua ee Oe ЕЕЗ EU EA E 


SUOISSTUK) "ON. 


cd yy siog “ON 
AYVWWOS 


мми оми 
омоми 
ммм омо 
м 
MAMMA 
24 
Bur 
2 
Sas 


KKK | 


d3ozUlp 
BmI 
tog 
ugor 
рівмон 
т ymy 
Tease 


K 
м 
и 


м мии ми 
M 

м 

ми 

м 

м 


VW 


і 
WA мим 


м 


м 


ххх K9u9N 


euuuof 
uuy 
&qqod 
a inea 
AIBN 

a my 


мими и MA | 
м мымыи AMA МИМ oOWWAOCO 


м ия 


м|омммо моей о|и|скииооо 


мим ими | rod ии икии 
T |м p UA ммымы|иниия 


$ | COOKKHO | ими [KHO | KKK 


Y 
$ 
1 | ммм 


иии ми 
ини M 


OTIS 
&ujo1o(p 
АЯ 

A Aag 
Sour f* 
PIPIN 
43394 


ми MH 


Wid и [We мии 
ин 


а на а а а | 
И | OO ONS aor бо вә 


wow | u01]904]qng 


uoa | 10102 40119041419 


ати 


SNOLLIOVHA _ 


uomppy | тапа JO SAYN 


SUIENAN ТОН. 


335 


336 MEASUREMENT IN INSTRUCTION 


by a standard test in arithmetic. The procedure is equally applicable to 
informal tests. This particular test, Test 3, Arithmetic Fundamentals, of 
the Metropolitan Achievement Tests, was administered to a fifth grade in 
October. The pupils are arranged in descending order according to the 
score on this test. Each error is indicated by X and each omission by 0 as 
far as the pupil attempted problems; the problems beyond the last one 
attempted are indicated by - - - - . The summary at the bottom shows 
how many times each problem in the test was missed and omitted. This 
simple analysis reveals clearly what type of problems caused trouble and 
to whom the trouble was caused. The procedure is really group diagnosis, 
but it may be regarded as the first step in individual diagnosis. It should 
be apparent that classroom teachers who are content merely to obtain the 
total seore made by each pupil on a test are really overlooking the greatest 
value of the test for instructional purposes. 

Similar error analyses can be made for most subjects, but are especially 
valuable in mathematics, spelling, reading, handwriting, and language. It is 
usually better to make more than one such analysis, however, than to rely 
upon a single sampling, which is almost sure to include some errors that are 
merely chance occurrences rather than habitual. Brueckner and Elwell, 
for example, found from the study of a test in the multiplication of frae- 
tions, containing in random order four examples of each type, that failure 
to work a single example correctly is hardly a safe index and that at least 
three problems of each type are required for a valid individual diagnosis. 
A later study ® in subtraction showed that all the problems of a type should 
be grouped together on the test. 

It is not sufficient, however, to stop with tabulating the frequencies of 
questions missed on tests or mistakes made in written work. A further 
analysis must be made of the types of errors represented. It will be noted 
that problem 24 in Figure 41 was missed by 23 pupils out of 28. As а basis 
for remedial instruction the teacher needs to know what types of incorrect 
solutions were made by her pupils. An examination of the test papers pro- 
vides the answer. Problem 24 follows: 


24. Add 34 
NE 


= 


Tt is found that 15 of the 23 incorrect solutions were 74, merely a failure 
to reduce the fraction to its lowest terms. Five of the 6 errors made by the 


18 Leo J, Brueckner and Mary Elwell, “Reliability of Diagnosis of Error in Multipli- 
cation of Fractions,” Journal of Educational Research, 26: 175-185, November, 1932. 
19 Leo J. Brueckner and Mabel J. Hawkinson, “Thè Optimum Order of Arrangement 
of Items in a Diagnostic Test," Elementary School Journal, 34; 351-357, January, 1934. 


DIAGNOSIS 337 


best 7 pupils were of this type. Five pupils got as an answer 7$, which rep- 
resents two types of errors. Still more serious is the status of the pupil who 
got $ for an answer. An interesting type of incorrect solution is represented 
by a pupil whose answer was 78. It is apparent that he merely added the 
numerators and the denominators without taking the trouble to reduce the 
fractions to a common denominator. The other wrong answer was 7$. 

A second illustration of the value of error analysis is taken from spelling. 
A few years ago the writer gave a spelling test to a class of high-school 
seniors. The results were disappointing. One of the words missed most often 
was “undoubtedly.” Contrary to expectation, a tabulation of the errors 
revealed the fact that the first two syllables were spelled correctly by all 
pupils. The misspellings were of four forms: “undoubtelly,” “undoubtely,” 
“undoubtaly,” and “undoubtally.” It can be seen that the fundamental 
error is mispronunciation. The pupils were attempting to spell this common 
word as they were accustomed to pronounce it. Hildreth” reports that con- 
fusion over vowels in the middle and end syllables is a prolific source of 
error, and that syllables containing e, a, and o, are especially liable to vague, 
indistinet pronunciation. Another investigator?! found that emphasis upon 
correct, pronunciation in reading resulted in a decided improvement in the 
spelling of pupils in the fifth and sixth grades. 

One of the greatest values of such error analyses is that they reveal that 
a relatively few types of errors made over and over again are responsible 
for the poor performance of most pupils. In an early study of errors in 
spoken language Charters? found that 71 per cent of the errors made by 
Pittsburgh children fell into only five classes. A study in Madison, Wis- 
consin,” revealed that more than half of the total number of language 
errors made from the kindergarten through the sixth grade represented 
but four types. In an extensive study Newland” found that errors in writ- 
ing only four letters, a, e, r, and i, accounted for almost half of the illegibil- 
ities made, whether by elementary-school, high-school, or adult groups, and 
that only four types of difficulties in letter-formation caused more than 
half of the illegibilities. It cannot fail to be encouraging to teachers and 
pupils alike to find that remedial efforts directed at a relatively few trouble- 
some points may result in great improvement. 

Locating the causes of errors. Even more important, and usually far 
more difficult, than knowing where the errors occur is knowing why they 
occur. One limitation of test scores in diagnosis is that they reveal the 


20 Gertrude Hildreth, op. cit., page 492. yat 

21 Marjorie E. Кау, The Effect of Errors in Pronunciation upon Spelling," Elementary 
English Review, 7: 64-66, e 1 

22 Unpublished report made . } у H ^ 

= pu. [rese Committee Reports. Madison, Wisconsin: Madison Public 
Schools, 1932. ATE 

2t Т, Ernest Newland, “Ап Analytical Study of the Development of Illegibilities in 
Handwriting from the Lower Grades to Adulthood," Journal of Educational Research, 
26: 249-258, December, 1932. 


338 MEASUREMENT IN INSTRUCTION 


products of learning rather than the learning process itself. Tyler” makes а 
useful distinction between measurement or appraisal, and interpretation 
or inference. In other words, causation is not established directly by the act 
of measurement, but must be inferred from the measurement and other 
pertinent data. Scates?* puts the situation clearly: 


A multitude of test scores are in themselves meaningless. They show facts, but 
they do not show reasons. They neither diagnose nor evaluate. They may be use- 
ful aids, but they leave the principal problem to the teacher’s insight, namely, that 
of determining what is indicated. 


At times, as in some of the examples cited, a reasonably safe inference can 
be made from the nature of the errors themselves. But rarely can a suffi- 
ciently complete explanation be made without considering the child’s past 
history, outside the school as well as inside. It is never safe to infer that a 
child’s poor performance in school is due to mental deficiency or person- 
ality defects, unless a careful study of his educational opportunities has 
been made. Fortunate indeed is the school whose records are sufficiently 
complete to provide the essential data. 

Certain outstanding physicians and surgeons have advocated an en- 
larged concept of diagnosis in modern medicine. Several years ago Sir 
William Osler argued that it was more important to know what kind of 
man had a certain disease than to know what disease the man had. Wilbur 
has made the following statement:”” 


Tt is just as important in these days for a young doctor to understand his patient's 
personal life, home responsibilities, and community relationships, as it is to be able 
to tell just what organisms are living in his lungs or invading his liver. . . . The 
doctor who has not studied psychology and who cannot acquire a knowledge of it. 
if he is to be successful, will have to confine himself to work in the laboratory or 
be a pure technician. 


Hildreth suggests that the following five “areas of investigation” ?? are im- 
portant in diagnosis: 


Mental equipment of the learner. Aptitude for academic schoolwork, learning 
capacity, readiness for learning, habitual modes of response, judgment, reasoning 
ability, insight, memory, association, perception, attention span, ability to see 
relationships, creative ability, intellectual interest, suggestibility, comprehension, 
auto-criticism, habits. 


25 Ralph W. Tyler, “Elements of Diagnosis,” Thirty-Fourth Yearbook of the National 
Society for the Study of Education, page 113. Quoted by permission of the Society. 
Bloomington, Illinois: Public School Publishing Company, 1935. 

2% Douglas E. Scates, “Differences Between Measurement Criteria of Pure Science 
M of Classroom Teachers,” Journal of Educational Research, 37; 1-13, September, 
1943. 

i a Lyman Wilbur, “The March of Medicine,” Science, 37: 201-202, March 4, 
938. 

„28 Gertrude Hildreth, Learning the Three Ев, pages 547-549. Minneapolis: Educa- 
tional Publishers, Inc., 1936. 


DIAGNOSIS 339 


Language equipment: Command of mother tongue, knowledge of foreign lan- 
guages, language first learned, speech defect, immaturity in speech, in articulation, 
or diction; vocabulary, rapidity or slowness of speech, history of speech develop- 
ment, age of using words and sentences, descriptive powers, written composition. 


Personality, temperament, and dynamic equipment. Self-control, affability, de- 
sirable and undesirable inhibitions, attitudes, friendliness, susceptibility, docility, 
irascibility, drive, perseverance, stability, lability of mood, compliance, responsive- 
ness, restlessness, shyness, tendency toward embarrassment, day dreaming, fears, 
withdrawal from reality, sex interest, morbid curiosity, irrational attitude, man- 
ners, attitude toward failure and toward the school disability, compensations, 
child’s interests, attitude toward school, preferred school subjects, child’s play 
interests, obsessions, fears, worries, ability to get along with other children, social 
qualities, attitude toward brothers and sisters and other members of the family, 
delinquent and anti-social activities, degree of normal adjustment, changes, growth 
and development in all these factors since birth. 


Physical status, sensory and motor equipment, physical conditions. Sensory acuity, 
constitutional defects, physical maturation, physical handicaps and defects, disease 
history, glandular balance, condition of teeth, etiology of illness, posture, accidents 
or unusual physical shocks, nutrition, diet, hygiene, psycho-motor status, muscular 
strength or weakness, handedness, steadiness, coordination, efforts to change 
handedness, facility in sports and games. 


Environment and home history. Economie factors, literacy of parents, number of 
sibs, marital status of parents, foreign background, other adults in the home and 
their contact with this child, evidences of culture, e.g., books, musical instruments, 
labor-saving Eu in the home, harmony in home adjustments, attitude of home 
toward sehool, cooperation of home with school, neighborhood environment, as- 
sociation with other children, child’s opportunity for free time, child’s activities in 
free time. 


Child’s daily schedule: Rising, eating, sleeping, play, schoolwork at home; 
regularity or irregularity in home program. 

School situation, history and present status. Methods of instruction, especially in 
the work with which the child has difficulty; size of class groups, capability of class 
groups, school marks, textbooks and other materials used, progress of other chil- 
dren, progress in learning from grade to grade, retardation, failure or double 
promotion, attitude of child toward teacher, teacher's usual success with pupils 
of her grade level, teacher’s experience, rapidity with which average child pro- 
gresses, requirements of the course of study, classification system, provision for 
individual assistance, date of first recognition of the child’s disability, former 
diagnostic and remedial work carried on with the child both in school and in 
clinics, survey of all school records that would throw light on the situation, kind 
and extent of supervision, objective test records, analysis of previous training and 
methods of attack in learning, e.g., to write; evidence of readiness for instruction 
before work in skills, teacher's story of the case, attitude toward the child, dis- 
cipline in the classroom, absence from school, tendency toward tardiness, truancy. 
Information the teacher has about modern methods in education and child study, 
progressiveness.of the school program, extent to which teacher makes individual 
studies and keeps cumulative records of pupil, age of the child on entering school, 
terms retarded, failure in specific subjects, absence. Extent to which each teacher 
is acquainted with the child’s past school history, extent to which the teacher 
knows the facts at the beginning of the school term, teacher’s explanation of the 
eause of difficulty, teacher's recommendations as to what should be done, efforts 


340 MEASUREMENT IN INSTRUCTION 


the teacher has been making to eradicate the difficulty, extent to which the teacher 
eapitalizes the child's interests. 


It is, of course, manifestly impossible, as well as usually unnecessary, 
to consider all these facts in any particular case. Satisfactory explanation 
of the less serious cases can often be found in a relatively few factors, 
although rarely, if ever, in just one. The more serious cases will usually be 
found more complex to analyze as well as more difficult to remedy. 

It will frequently be necessary to supplement the data of the existing 
school records. A visit to the pupil’s home is often helpful. A careful obser- 
vation of the pupil at work is another fruitful source of information. Objec- 
tive records of observations made under controlled conditions are particu- 
larly important. Considerable light is often thrown upon the attitudes and 
work habits of unsuccessful pupils by observing them at work and then by 
comparing successful pupils under similar conditions. 

A skillful interview by a tactful teacher will sometimes give a clue to the 
difficulty when other methods fail. In the upper grades and the high school, 
check lists, questionnaires, and other forms of written responses are valu- 
able aids to the personal interview. Having the pupil "think out loud" 
through the solution of a problem in mathematics or science, or give an 
explanation of the procedure used, is often most illuminating. 

Two illustrations make clear the value of the interview as a supplement 
to the written test in locating the sources of difficulty in arithmetic, Bus- 
well tells of a boy of better than average intelligence whose work in column 
addition was both slow and inaccurate. To the interviewer he explained 
that he did not like to add and so wanted to get the worst of it over as soon 
as possible. For this reason he always added the numbers according to size, 
beginning first with the largest numbers and leaving the smallest ones till 
last. But as this technique meant skipping up and down the column, it 
involved great risk of omitting some of the numbers altogether and of add- 
ing others more than once. The story is told of a sixth-grade school-girl 
who had an elaborate but somewhat ineffective “system” for solving rea- 
soning problems. Her explanation was somewhat like this: “Whenever 
there’s lots of numbers, I add, but when there’s only two numbers with lots 
of parts [digits], I subtract. But if there is just two numbers and one is 
littler than the other, I divide when they come out even, and multiply 
when they don’t.” It is most unlikely that any analysis of test papers, of 
observation of the pupils at work, would have resulted in a correct inference 
as to the real trouble in either of the cases above. 

Teachers often find that an interview with the pupil sheds needed light 
upon difficulties in reading and English. Pressey and Campbell” report 
that one ninth-grade pupil explained capitalizing the word “Pirates” on the 
ground that pirates are real persons just as much as “John Silver” or 


_ P Sidney L. Pressey and Pera Campbell, “The Causes of Children’s Errors in Capital- 
ization: A Psychological Analysis,” English Journal, 22: 197-201, March, 1933. A 


DIAGNOSIS 341 


"Captain Kidd.” Another teacher discovered that a boy had written “a 
quarter to three" in answer to a question on a reading test when the correct 
answer was "twenty-five minutes till three” because everybody knows that 
twenty-five cents make a quarter! 

Brownell” has shown the possibilities of classifying the mental processes 
used by the pupils as revealed by interviews according to levels of maturit y 
represented. He concludes that a reasonably flexible interview technique 
in analyzing learning is "exceedingly valuable if it is sagaciously em- 
ployed.” One survey? of the experimental literature relating to the reli- 
ability of the interview arrives at the conclusion that “with well-trained 
interviewers working under carefully defined conditions, quantitative in- 
terview ratings representing a complex over-all evaluation can be made as 
reliable as most personality tests, and more reliable than some of them.” 
Nevertheless, good interviewing requires skill as well as time and patience. 

Remedial procedures. The ultimate purpose of diagnosis is to afford 
a basis for effective remedial procedures. When the cause or causes of the 
pupil’s unsatisfactory adjustments have been determined, an intelligent 
program of correction can be planned, and not until then. Whenever the 
same causes appear to operate in several pupils, group measures may be 
satisfactory. Usually, however, remedial programs must be planned for each 
pupil individually. 

A study by Davis? shows the close relationship between educational 
diagnosis and remedial instruction. Two extra periods a week were devoted 
to 275 pupils of poor spelling ability in grades 2B to 6A, inclusive. The 
results showed “marked improvement." Pupils remained in the remedial 
classes until they made perfect scores on the spelling tests of two successive 
Fridays. The average time required was 7.5 hours, and bore little relation- 
ship either to intelligence or grade location. Twenty-four different types 
of difficulties were located, and listed with each difficulty were the most 
successful remedies found by the teachers. The ten most common diffi- 
culties, with their remedies, are shown in Table 34. 

Traxler? has prepared some very convenient charts which outline ap- 
propriate diagnostie and remedial procedures for common types of disabil- 
ities in reading, arithmetie, language usage, spelling, and handwriting. 
Figure 42 shows the chart for handwriting. Note that a detailed analysis 
of samples of the pupil's writing is suggested, as well as diagnostic charts 
and tests. 


30 William A. Brownell, “Rate, Accuracy and Process in Learning,” Journal of Edu- 
cational Psychology, 35: 321-327, September, 1944. ips 

8 Sidney H. Маан Joseph M. Bobbitt, and Dale C. Cameron, “The Reliability 
of the Interview Method in an Officer Candidate Evaluation Program," American 
Psychologist, 1: 103-109, April, 1946. 

> Georgia Davis, Roedal Work in Spelling,” Elementary School Journal, 27: 615- 
626, April, 1997. 

з. Arthur Е. Traxler, op. cit., pages 34-35, 


342 MEASUREMENT IN INSTRUCTION 


TABLE 34 


DISTRIBUTION or SPELLING DIFFICULTIES AND SUCCESSFUL REMEDIES (AFTER Davis) 


Difficulties and Remedies Frequency 
1. Has not mastered the stepe in learning to spell a word.............. 88 
a. Teach steps until every child knows them and uses them. 
b. Study each word with the children. 


за Воо РМАЙООНЕ ИТЕРЕ ЧААН 88 
а. Discover particular letters or combinations of letters that are diffi- 
cult and practice on these letter combinations. 
b. Practice words containing writing difficulties. 


3. Cannot pronounce the words being вёцдіеӣ........................ 78 
a. Go over the words before the children study them so that every 
child will know what he is studying. 
b. Help the child to unlock words for himself. 


4. Has bad attitude toward вреор.................... аа 71 

a. Supervise study closely so that the child will get into the habit of 
studying words correctly without wasting time. 

. Try to show need for study. 

Give study work under time pressure. 

Try to appeal to pride. 

Try to work up competition with self (that is, of the pupil with 

himself). 

f. Give reward. 


фары 


5. Does not associate the sound of the letters or the syllables with the 
polmprotthenvordoss up а. Dae ME rane оаа 49 
a. Teach letter sounds. 
b. Listen to careful pronunciation. 
c. Teach the child to syllabify words. 
d. Say words slowly again and again to hear sounds. 


6. Needs more time than can be devoted to spelling in the regular class 21 


a. Give more time after school or during the day when other work is 
finished. 


7. Is discouraged because he misspelled so many words in the Monday 
Po РИИ Е Е E ee ne dre 20 
n. Take a few words at a time. 
b. Study at odd times during the day. 
c. Have the pupil stay longer in the afternoon than the others. 


9. Has-speech defeat: Lindl ENDE epa pa Ee E ER rr RR rh 16 
2. Listen to pronunciation. 
b. Look at word carefully. 
с. Teach difficult combinations, 


9. Does поё mark paper correctly:.. 1. desee a ehe e herren 16 
a. Teach child how to check. 
b. Insist on rechecking. 
в. Always check paper. 
ОЙ Я TEE E CIE AE LL ORE «kiwis aces en 10 


a. Study words carefully. 
b. Underline difficult, part. 
с. Try to spell by syllables. 


DIAGNOSIS 


343 


An effective method with bright pupils may fail with dull. In fact, no 
method is likely to improve materially the academic achievement of the 
mentally deficient child. Even with normal or superior children the sub- 
stitution of correct habits for incorrect will require time. No sudden trans- 


CHART V: HANDWRITING 


SUGGESTED DIAGNOSTIC AND REMEDIAL PROCEDURES 


TYPE oF DiaGnostic 
DEFECT PROCEDURE 
1. Slant l. Use diagnostic chart; 


a. Too much slant 

b. Writing too 
straight 

c. Lack of uni- 
formity 


2. Alignment 2. 


a. Lack of uni- 
formity 

h. All letters about 
the same height 


3. Quality of line 3. 
a. Writing too heavy 
b. Writing too light 
г. Line wavy and 
uncertain 


4. Formation of letters | 4. 
a. Poor general form 
b. Lack of smooth- 
ness 
Parts omitted 
. Parts added 
e. Letters not closed 


ae 


study different samples of 
writing. Draw lines 
through letters parallel to 
slant on different parts of 
page. Compare these lines 
as to direction. Observe 
pupil as he writes and note 
details—position, paper, 
ete. 


Use diagnostic chart; draw 
horizontal lines through 
writing even with top and 
bottom of some of the let- 
ters. 


Use diagnostic chart; note 
type and size of pen and 
manner of holding it; note 
speed of writing. 


Use diagnostic chart; if de- 
sired, letter form may be 
analyzed in detail with 
Pressey chart. Study gen- 
eral form and habits of 
forming each letter. Often 
faults in letter form are re- 
lated to only a few letters. 


l. Some 


SuccEsTED Tyres or 
REMEDIAL TREATMENT 


instances of poor 
slant can be corrected by 
changing position of writing 
arm or manner of grasping 
pen. Change in position of 
paper will help others. Note 
that paper should be at an 
angle. Other pupils must 
learn to turn their hand as 
they approach end of line. 
Explain to pupils effect of 
slant on quality. 


. Explain defect to pupil. 


Lack of uniformity of align- 
ment results partly from 
motor inco-ordination and 
will probably be corrected 
as co-ordination of writing 
movements improve 
through practice. 


3. Make sure that pupil has 


proper writing materials; 
see that he does not use his 
writing arm to support his 
body. lf line is thin and 
wavering give drills to 
speed up movement and 
improve co-ordination. 


. Make some use of move- 


ment drills to improve 
smoothness. Practice espe- 
cially on movements com- 
mon to several letters. 
Study details of letter form 
with pupils and show them 
where they need to improve. 
Have pupils practice indi- 
vidually on the letters which 
diagnosis has shown to be 
poorly formed. 


TYPE oF 
Drrect 


5. Spacing of words 


a, Too wide 
b. Too narrow 
с. Not uniform 


6. Spacing of letters 


a. Too wide 
b. Too narrow 
с. Not uniform 


7. Size of writing 


a. Too large 

b. Too small 

с. Lack of uni- 
formity 


. Writing not neat 
a. Blotches 
b. Words crossed out 
and rewritten 


. Speed 
a, Writing too slow 
b. Writing too fast 


CHART V (Continued) 


DraGNosTIC 
PROCEDURE 


5. and 6. Use diagnostic 


chart; study various sam- 
ples; note whether wide 
spacing or crowding occurs 
on different part of page. 
Observe pupils while writ- 
ing for evidence of too 
much lateral movement. 


7. Study different samples 
and compare with those of 
other pupils in the same 
grade. Considerable varia- 
bility is allowable among 
individuals and especially 
between grades, Young 
pupils tend to write large. 
Note freedom of move- 
ment. Try to discover cases 
of lack of uniformity. 


8. Examine samples of writ- 


ing, especially those pre- 
pared in daily work. With 
respect to blotches see if 
writing materials are de- 
fective. 


9. Speed of writing affects 


the quality, but aside from 
this fact it is important in 
that some pupils write so 
slowly and laboriously that 
they have difficulty in pre- 
paring assignments on 
time, Give a test of speed 
of writing and compare 
number of letters per min- 
ute with grade norms, 


== 
SUGGESTED TYPES or 
REMEDIAL TREATMENT 


———— 


. and 6. Explain fault to 


pupil. Have him pay espe- 
cial attention to spacing 
while writing samples to be 
inspected by teacher. Move- 
ment exercises are of some 
value in improving spacing. 


. Writing that is too small 


may result from a cramped 
finger movement. Give 
movement exercises to relax 
pupil and bring about some 
arm movement. If writing 
is too large pupil can some- 
times correct it through 
conscious effort if his atten- 
tion is called to it. In young 
pupils, improvement may 
have to await the process of 
maturation, 


. See that pupil has proper 


writing materials and that 
‘hey are kept in working 
order. Explain effect of lack 
of neatness on all school 
work. Make daily work in 
other subjects the gauge of 
neatness. 


. If writing is too fast show 


pupil its effect on letter 
form and have him write 
samples under timed con- 
ditions. Some pupils go to 
the other extreme and write 
so slowly that they prac- 
tically draw the letters. 
Give movement exercises 
while counting rapidly to 
break down habits of slow 
movement. Insist that pu- 
pils speed up writing re- 
gardless of their letter 
forms. Their writing will 
probably deteriorate for а 
time, but when old habits 
are broken down teacher 
and pupil can build new 
ones, 


о 


Figure 42. Traxler Chart of Suggested Diagnostic and Remedial Procedures in 


Handwriting. 
344 


DIAGNOSIS 345 


formation is to be expected. But if only negligible progress results from 
extended practice, the remedial program should be revised. 

Preventive diagnosis. In the long run, the greatest value of a diagnostic 
and remedial program is the discovery of preventable factors within the 
control of the school which lead to maladjustment. Frequently, modifica- 
tions in school organization, curriculum, instructional materials, and teach- 
ing methods are suggested by an analysis of what is happening to the pupils 
under the existing program. Manifestly, factors which have produced learn- 
ing difficulties in the past are likely to do so in the future. It is always 
better, and generally easier, to prevent errors than to correct them. It will 
often be found that a program of studies which provides wider differentia- 
tion in method and content to suit pupils of varying abilities and interests 
is the way out of many difficulties. The systematic use of readiness tests 
of various types to determine when the pupil is sufficiently mature, phys- 
ically, mentally, and socially, to begin the regular work of the first grade, 
and a judicious use of aptitude tests to establish the pupil’s fitness for the 
more formal and abstract subjects, such as arithmetic, algebra, and foreign 
language, will prevent much needless failure. Terman® says: “Perhaps the 
most important conclusion to be drawn from the extensive researches here 
reported is that disability of any degree in any of the basic school subjects 
is wholly preventable.” Prevention is the highest level of diagnosis, its ultimate 
goal. 


SELECTED REFERENCES FOR FURTHER READING 


Betts, Emmett Albert, Foundations of Reading 1 nstruction with Emphasis on 
Differential Guidance. New York: American Book Company, 1946. 758 pages. 

Blair, Glenn Myers, Diagnostic and Remedial Teaching in Secondary Schools. 
New York: The Macmillan Company, 1946. 422 pages. 

Boyd, Gertrude, and Schwiering, О. C., “А Survey of Child Guidance and Remedial 
Reading Practices,” Journal of Educational Research, 43: 494-506, March, 1950. 

Cook, Walter W., “Тһе Functions of Measurement in the Facilitation of Learning," 
Chapter 1 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Cronbach, Lee J., Educational Psychology. New York: Harcourt, Brace and Com- 
pany, 1954. Chapters 6 and 7, “Assessing Readiness: I. Personality and Motiva- 
tion" and “Assessing Readiness: II. Abilities.” 

Dressel, Paul L., and Mann, William A., “Appraisal of the Individual,” Review of 
Educational Research, 21: 115-181, April, 1951. р ; 

Hildreth, Gertrude, Learning the Three R’s (Second Edition). Philadelphia: Educa- 
tional Publishers, Inc., 1947. 897 pages. 

McCullough, Constance M., Strang, Ruth M., and Traxler, Arthur E., Problems 
in the Improvement of Reading, New York; McGraw-Hill Book Company, 1946. 
406 pages. 

34 Lewis M. Terman, “Foreword” to Grace M. Fernald’s Remedial Teaching in Basic 

Schoot Subjects, page ix, copyright 1943. Reprinted by permission of the publishers, 

McGraw-Hill Book Company, Inc. 


3416 MEASUREMENT IN INSTRUCTION 


Simpson, Robert G., Fundamentals of Educational Psychology. Philadelphia: J. В. 
Lippincott Company, 1949. Chapter 13, "Analyzing the Learner's Difficulties," 

Stauffer, Russell G., “Certain Basic Concepts in Remedial Reading," Elementary 
School Journal, 51: 334-342, February, 1951. 

Stuit, Dewey B., “Counseling Methods: Diagnostics," Annual Review of Psychology, 
2: 305-316, 1951. 

Tyler, Ralph W., “The Functions of Measurement in Improving Instruction,” 
Chapter 2 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 


13 


Classification and Promotion 


A. The Nature and Educational Significance of 
Human Variability 


The problem of human variability. The existence of variability is 
one of the best established facts about human beings. Obvious differences 
in height, weight, strength, and good looks could hardly escape the notice 
of the most casual observer. The greatest seers and wise men of all ages 
have recognized also the less obvious but more important differences in 
ability, interests, and needs. One of the familiar parables of Jesus, for ex- 
ample, is that of the talents.! 

It would be difficult today to find a fuller recognition of the educational 
significance of individual differences than appears in the writings of those 
two apostles of human liberty, Jean Jacques Rousseau and Thomas Jeffer- 
son. Rousseau asserted that “it would be a great mistake to bestow it 
[instruction] on all children indiscriminately and without regard to their 
individual differences."? Jefferson wrote of a proposed educational meas- 
ure: “The general objects of this law are to provide an education adapted 
to the years, capacity and the condition of every one, and directed to his 
freedom and happiness."? It is apparent, therefore, that when the author 
of the Declaration of Independence penned the famous line, “АП men are 
created equal,” he had in mind equality before the law, and that he recog- 
nized fully the duty of the state through education to provide equality of 
opportunity. A prominent American educator* argues that “our concepts 

1“And unto one he gave five talents, to another two, and to another one; to every 
man according to his several ability.” Matthew 25:15. 


2 Jean Jacques Rousseau, The New Heloise, Part V, Letter 3. 
з Thomas Jefferson, Notes on Virginia, pages 250252. . А 
41, Newton Edwards, “We Need New Purposes in Education," Phi Delta Kappan, 
28: 16, September, 1946. 
347 


348 MEASUREMENT IN INSTRUCTION 


of freedom and equality are outmoded" and cannot both be realized, since 
they "are in fact mortal enemies." 

It is surprising, therefore, to find that the problem of individual differ- 
ences was not seriously treated in psychology before the time of Galton 
in the latter half of the nineteenth eentury, a neglect which has been char- 
acterized as perhaps the “most extraordinary blind-spot in previous psy- 
chology.” 5 

Otto* estimates that during the last twenty years more time and effort 
in educational research have been devoted to the study of individual dif- 
ferences than to any other single topic; he is greatly impressed with the 
extensive literature available. Yet in 1925 a competent school psychologist 
stated that the “schools heretofore have to a large extent ignored these 
differences."" Five years later a national survey revealed that “provisions 
for individual differences, in general, are innovations in the secondary 
schools.” In 1936 a survey of 300 courses of study showed that only about 
one in ten “contain any suggestions for adapting instruction to individ- 
uals.”” Fight years later an educational psychologist characterized as 
largely “‘lip service” the attention educators give to individual differences. 
Davis says: 

Despite its philosophy of individualization, the school, in practice, fosters a 
program of regimentation and standardization. 


In the meantime, better enforcement of the compulsory education laws 
and the rapid increase of secondary-school enrollments have served but to 
intensify the problem, the nature of which was more accurately revealed 
by scientific measurement." 

Group differences. Scientific research, on the whole, has shown that 
differences between groups are not so great as they are commonly assumed 


5 Gardner Murphy, Historical Introduction to Modern Psychology (Revised Edition), 
page 117, New York: Harcourt, Brace & Company, Ine., 1949. 

* Henry J. Otto, Elementary School Organization and Administration (Second Edition), 
page 160. New York: D. Appleton-Century Company, 1944. 

ТА. A. Sutherland, “Factors Causing Maladjustment of Schools to Individuals,” 
Twenty-Fourth Yearbook of the National Society for the Study of Education, Part И, 
pages 20-30. Bloomington, Illinois: Public School Publishing Company, 1925. 

8 Roy О. Billett; Provisions for Individual Differences, Marking and Promotion, Na- 
tional Survey of Secondary Education, Monograph No. 13, page 8. Washington, D. C.: 
United States Office of Education, 1932. 

° Henry Нагар, “Differentiation of Curriculum Practices and Instruction in Element- 
ary Schools,” Thirty-Fifth Yearbook of the National Society for the Study of Education, 
Part I, page 162, Bloomington, Illinois: Public School Publishing Company, 1936. 

10 Robert A. Davis, “Experimenting in Education,” Educational Administration and 
Supervision, 30: 1-16. January, 1944, 

п For an excellent summary, see A. В. Gilliland and E. L. Clark, Psychology 
of Individual Differences, 535 pages. New York: Prentice-Hall, Inc., 1939. A standard 
work in this area is Anne Anastasi and John P. Foley, Jr., Differential Psychology: 
Individual and Growp Differences in Behavior (Revised Edition), 894 pages. New York: 
The Macmillan Company, 1949, 


CLASSIFICATION AND PROMOTION 349 


to be. There is little basis for the widespread illusion that the group of which 
one happens to be a member is superior, while all others are inferior. The 
intellectual differences between the sexes, for example, are in general slight. 
Furthermore, all levels of mental ability are found in all economie, oceupa- 
tional, and social groups, although not in the same proportions. Even the 
differences between races have been grossly exaggerated, and such differ- 
ences as appear probably reflect cultural rather than innate intellectual 
variations. It is manifestly impossible to make adequate provision for indi- 
vidual differences by classifying pupils for instructional purposes according 
to the social, economie, oceupational, racial, or other similar group from 
which they come. Fortunately, perhaps, for democracy the problem is not 
so simple as that. 

Almost without exception the average differences between groups are 
less significant than the differences within any single group. An important 
example of this is the enormous overlapping among school grades. Although 
the average difference in intelligence and in achievement between succes- 
sive school grades rarely exceeds one year, the difference within any grade 
is likely to be at least four or five years. As a matter of fact, Baker’ points 


COMMITTEE 
JUDGMENT 


NORMAL 
ACTUAL 


Figure 43. General Quality of 200 Secondary Schools as Judged by Field Com- 
mittees. Each @ represents an actual school. Each О represents a school in theoretical 
distribution, (From Education of Secondary Schools, General Report, page 110.) 


out that the achievement of the more capable halves, or the less capable 
halves, of two adjacent grades is usually much more alike than that of the 
two halves of the same grade. On a test of general academic knowledge, 
the Pennsylvania study showed that about 10 per cent of the high-school 
seniors exceeded the median of the college seniors, while nearly 10 per cent 


12 Thirty-Fifth Yearbook, op. cit., pages 187, 145. 


350 MEASUREMENT IN INSTRUCTION 


of the college seniors fell below the median of the high-school seniors. 

It must not be thought, however, that one group is just like every other, 
As a matter of fact, certain types of groups differ from each other very 
much as the individuals within any one group differ from each other. For 
example, Figure 43 shows the distribution of the ratings of 200 hi gh schools, 
which closely approximates the normal curve. It is quite likely that all the 
schools in a single state would be similarly distributed on practically every 
characteristic. 


о 320 30 400 440 480 520 560 600 640 680 720 760 50 840 880 920 
MEAN SCORE 


Figure 44. Distribution of Mean Scores of Seniors in Forty-Nine Colleges in 
Pennsylvania on a Test of General Academic Knowledge. (Data from The Student and 
His Knowledge, page 78.) 


Figure 44, although somewhat asymmetrical, shows that, on the basis of 
the mean achievement of their seniors, 49 colleges in Pennsylvania have 
the wide range and the heavy concentration near the center that charac- 
terize normal eurves. In other words, there are differences among institu- 
tions just as there are among individuals. It is this fact that makes the 
traditional classification of schools for accrediting purposes such a baffling 
problem, and that has been responsible for the trend toward evaluating 
each school in relation to its own objectives and program rather than in 
relation to other schools. It is now being recognized that it is just these 
differences that give individuality and distinction to institutions. 

* William 8. Learned and Ben D. Wood, The Student and His Knowledge, page 21. 
New York: Carnegie Foundation for the Advancement of Teaching, 1938, For a later 
illustration from the results of the Army General Classification Test (AGCT), see: 


Walter V. Bingham, “Inequalities in Adult Capacity—from Military Data,” Science, 
104: 147-152, August 16, 1946. 


CLASSIFICATION AND PROMOTION 351 


Individual differences. In contrast with the differences between groups, 
which have frequently been overestimated, the differences within the group 
have usually been underestimated. While a vague notion of individual dif- 
ferences has long been in existence, no adequate knowledge of the nature 
and extent of these differences was possible before the appearance of sci- 
entifie measurement. Such profound thinkers as Plato, for example, be- 
lieved that all persons fell into a few rather distinct groups. In fact, the 
idea that many human abilities are distributed on a continuum, with a 
concentration near the middle, is a modern conception. 


35 45 55 65 75 85 95 105 115 125 135 145 155 165 
44 54 64 74 84 94 104 114 124 134 144 154 164 174 
INTELLIGENCE QUOTIENT 


Figure 45. Distributions of Composite IQ's on Forms L and M of the Revised 
Stanford-Binet Intelligence Scales for a Standardization Group of 2,904 Individuals of 
CA's 2 to 18 years. (From Terman and Merrill Measuring Intelligence, page 37.) 


Figure 45, according to Terman and Merrill, “probably gives the clearest 
picture available of the intellectual differences which obtain among 
American-born white children of the ages in question.” Figure 21 on 
page 260 shows a similar distribution for ninth-grade pupils. Three char- 
acteristics of the so-called “normal curve" should be noted: (1) the wide 
range from lowest to highest scores, (2) the continuous distribution—no 
breaks, and (3) the distinet tendency to pile up near the center. Many dis- 
tributions of test scores have these same characteristics. Skewed curves 
differ from symmetrical curves in that the heavy concentration is not 
exactly at the center. In a normal distribution approximately two thirds 
of the pupils lie within a standard deviation distance from the mean. After 
а survey of the experimental evidence, Hull says: 


14 Lewis М. Terman and Maud A. Merrill, Measuring Intelligence, page 37. Boston: 


" iffi 1937. . 
E brane wi Testing, page 36. Yonkers, New York: World Book Com- 


pany, 1928 


352 MEASUREMENT IN INSTRUCTION 


We shall probably not be in great error if we conclude that among individuals 
ordinarily regarded as normal, in the average vocation the most gifted will be between 
three and four times as capable as the noorest. 


Important as is the wide range of ability between the two extremes, the 
importance to education of the continuous distribution is equally great. 
On no trait do individuals naturally fall into a few distinct groups, such as 
"inferior," “average,” and "superior," or “dull,” “normal,” and “bright.” 
Such so-called “types” are purely arbitrary. It would be possible to make 
an equally good case for any other number of classes. “In a literal sense, 
everyone is exceptional.” 1 There are similar differences in nonintellectual 
traits. 

Trait variability. Not only are there differences among groups, and 
differences among the individuals of any one group, but there are also 
important differences among the traits making up any partieular indi- 
vidual. Hull made a careful study of these differences and came to the 
conclusion that “ће distribution of talent within an individual follows the 
normal law much as do the distributions of individual differences." Not 
only did he observe a “distinct tendency to approach the characteristic 
shape of the normal probability curve,” but he also found evidence that 
“the average individual's best vocational potentiality must be between two 
and one-half and three times as good as his worst."!* The importance of 
these trait differences in an individual for educational and vocational guid- 
ance can hardly be overemphasized. It is also apparent that satisfactory 
ability grouping in one trait may be wholly unsatisfactory in other traits. 

The educational problem is further complicated by the fact that, while 
intercorrelations of these traits are usually positive, the correlations are far 
from perfect. This means that when an attempt is made to secure a group 
homogeneous in one factor, it is still heterogeneous with respect to other 
factors. It is apparent, therefore, that it is impossible to make groups truly 
homogeneous for instructional purposes, even if it were desirable to do so: 
The best that can be done is to reduce the amount of heterogeneity. Oppo- 
nents of ability grouping have made much of this point, apparently quite 
oblivious to the fact that ipso facto they are attacking a straw man; for, 
manifestly, one need shed no tears over the dangers of an educational sit- 
uation which one’s own data prove to be a physical impossibility. 

The concept of the versatile individual who is equally gifted in a con- 
siderable number of directions is largely a fiction and as an educational 
ideal is capable of doing much harm. It. has been ridiculed as follows:!® 


18 Edmund 8. Conklin and Frank $. Freeman, Introductory Psychology for Students of 
Education, page 515, New York: Henry Holt & Company, 1939. 

и Clark L. Hull, op. cit., page 46, 

18 Thid., pages 46, 49. 


? Amos E, Dolbear, “Antediluvian Education,” Journal of Education, 68: 424, 1908. 


CLASSIFICATION AND PROMOTION 353 


In antediluvian times, while the animal kingdom was being differentiated into 
Pages. climbers, runners, and fliers, there was a school for the development of 
the animals. 

The theory of the school was that the best animals should be able to do one thing 
as well as another. 

If an animal had short legs and good wings, attention should be devoted to 
running, so as to even up the qualities as far as possible. 

So the duck was kept waddling instead of swimming. The pelican was kept 
wagging his short wings in the attempt to fly. The eagle was made to run, and 
allowed to fly only for recreation. 

All this in the name of education. Nature was not to be trusted, for individuals 
should be symmetrically developed and similar, for their own welfare as well as 
for the welfare of the community. 

The animals that would not submit to such training, but persisted in developing 
the best gifts they had, were dishonored and humiliated in many ways. They were 
stigmatized as narrow-minded and specialists, and special difficulties were placed 
in their way when they attempted to ignore the theory of education recognized 
in the school. 

No one was allowed to graduate from the school unless he could climb, swim, 
run, and fly at certain prescribed rates; so it happened that the time wasted by 
the duck in the attempt to run had so hindered him from swimming that his 
swimming muscles had atrophied, and so he was hardly able to swim at all; and 
in addition he had been scolded, punished, and ill-treated in many ways so as to 
make his life a burden. He left school humiliated, and the ornithorhynchus could 
beat him both running and swimming. Indeed, the latter was awarded a prize in 
two departments. 

The eagle could make no headway in climbing to the top of a tree, and although 
he showed he could get there just the same, the performance was counted a de- 
merit, since it had not been done in the prescribed way. 

An abnormal eel with large pectoral fins proved he could run, swim, climb trees, 
and fly a little. He was made valedictorian. 

Educational provisions for individual differences. Attention has 
already been called to the fact that few schools are making adequate pro- 
visions for the individual differences existing in their pupils. No point stood 
out more prominently in Billett’s study” than this. Table 35 summarizes 
the situation for secondary schools in 1930. Billett reduces these provisions 
to seven categories: (1) homogeneous grouping; (2) special classes, (3) plans 
characterized by the unit assignment, (4) scientific study of problem cases, 
(5) variation in pupil load, (6) out-of-school projects and studies, and 
(7) advisory or guidance programs. Of these the first three “have been 
found to be core elements in a typically successful program to provide for 
individual differences.” But it will be noted from the last column that 
the most successful provision, in the opinion of those using it, was homo- 
geneous grouping, which had a ratio of 26 per cent. In other words, hardly 
more than one principal in four or five using any of these plans has a con- 
siderable degree of confidence in them. 


? Roy O. Billett, op. cit., pages 8-11. 
"1 Ibid., page 11. 


354 MEASUREMENT IN INSTRUCTION 


TABLE 35 


FREQUENCIES WITH WHICH VARIOUS PROVISIONS FOR INDIVIDUAL DIFFERENCES 
Were Reported IN Use, ов іх Use wit Unusvav Success, 
BY 8,594 SECONDARY ScHoots IN 1930 (AFTER BILLETT) 


7 f Ratio о 
Provision in f 


DEM U il Number of 
Provision in SENT Provisions 
Use Estimated in Use to 
ot Unusual A : 
Nature of Provision S Number in 
A UCcess T Е, 
Use with 
Estimated 
Num-| Per | Num- | Per Unusual 
ber | Cent ber | Cent | Success 
1. Variations in number of subjects а pu- 
ЭЦ MAY CATE ааа ires 6,428 | 75 795 9 .12 
2. Special coaching of slow pupils. ...... 5,099 | 59 781 9 15 
a Problem inethod. „н.а ана, 4,216 40 444 5 .10 
4. Differentiated assignments........... 4,047 47 788 9 .20 
5. Advisory program for pupil guidance..| 3,604 42 540 6 15 
6. Out-of-school projects or studies...... 3,451 | 40 439 5 13 
7. Homogeneous or ability grouping..... 2,740 32 721 8 .26 
8. Special classes for pupils who have 
failedos и а ца зы 2,612 30 350 4 3 
9. Laboratory plan of instruction....... “| 2,611 30 323 4 -12 
10. Long-unit assignments........ Б 27 349 4 AU 
11. Project curriculum. ....... 27 365 4 16 
12. Contract plan 27 465 5 .20 
13. Individual instruction............... I 25 309 4 14 
14. Vocational guidance through explora- 
tory Gourgses qo. Iter LOT ts twee rese: 1,911 22 186 2 .10 
15. Educational guidance through explora- 
ГОСУ, COUTRES уе EE CE NL E TES 1,900 22 193 2 10 
16. Scientific study of problem cases...... 1,343 | 16 146 2 Al 
17. Psychological studies................ 1,077 12 70 1 .06 
18. Opportunity rooms for slow pupils....| 946 11 172 2 AS 
TO ТОО DIA ean ory ites rr 737 9 175 2 24 
20. Special coaching to enable capable pu- 
pils to "skip" a grade ог half ргаде....| 726 8 114 1 .16 
21. Promotions more frequent than each 
ПА MERE S AY К 686 8 103 1 1d 
22. Remedial classes or rooms........... 593 7 90 1 15 
23. Adjustment classes or rooms. . . ТР 544 6 55 1 10 
24. Modified Dalton plan. .............. 486 6 52 1 Eu 
25. Opportunity rooms for gifted pupils...| 322 4 69 1 21 
26. Restoration classes 191 2 24 0 .13 
ОАО Ао о 162 2 15 0 .09 
28. Winnetka technique................- 119 il 14 0 12 
29. Other techniques................... 101 1 #2 


Table 36, based on returns from 48 large and 58 small cities, shows trends 
in the elementary school.” Increasing attempts to introduce.more flexible 


? V. у. Caldwell, “Some Facts Regarding Elementary School Trends," School and 
Society, 49: 285-288, March 4, 1939. 


CLASSIFICATION AND PROMOTION 355 


TABLE 36 


TRENDS TOWARD GnEATER PROVISIONS FOR INDIVIDUAL DIFFERENCES IN 
ELEMENTARY SCHOOLS (AFTER CALDWELL) 


Large Cities | Small Cities 


Propisi ERI E31 
Provision Num-| Per | Num-| Per 
ber | Cent | ber | Cent 
A. Daily Schedule: 
1. Loager period; ^i aa oO DERE NR ee 23 50 32 55 
2. Flexible;progrüm EE 39 85 40 85 
3. Subject-matter headings eliminated........... 17 37 22 38 
4. Skills, content, creative activities............. 35 76 38 66 
B. Curriculum Content: 
1. Child experiences as learning basis............. 38 83 43 74 
2. Elimination of specific subjects................ 9 20 15 26 
3. More freedom for teacher in interpreting course 
of study 7. CURE AMENS MEDIUS NE 39 | 85 | 50 | 86 


. Emphasis on habits, not fact-learning. 


4 

5. Elimination of drill periods.................... 7 15 13 22 
6. Relating learning materials to maturation of child | 34 74 37 64 
7. Relating learning material to immediate need and 


mental сарану ось 39 85 38 66 
8. Experience used to develop number concepts....| 32 70 31 53 
9. Delay in formal presentation of abstract arith- 
metic аєа E RT RC 27 59 37 64 
10. Elimination of health as a subject.............. 25 54 33 57 
11. Provision for hobby development.............. 39 85 44 76 
12. Vacation activity program development......... 24 52 28 48 
C. Physical Environment: 
1, Comfortable, adjustalile school furniture. ....... 32 70 42 72 
2. Automatic lighting equipment...............-. 11 24 13 22 
3. Automatic heating control...........-...+..-- 30 65 30 52 
4. Materials used for sight conservation........... 21 45 19 33 
5. Provision for lunch тоот.....................: 30 65 29 50 
6. Provision for rest facilities. ................... 28 61 26 45 
7. Isolation of sick children............+,-:2+-+.++ 24 52 29 50 
8. More floor space per сЫШа..................... 17 37 15 26 
9. Provision for safe play аррагабиз.............. 25 54 33 57 
10. Provision for ample playgrounds.............-. 37 80 1 71 
11. Provision for play space in bad Weather......... 18 39 25 43 
D. Materias: 
1. Basal texts eliminated (skills, content).......... 10 22 16 28 
2. Wide reading material, various levels........... 40 87 55 95 
3. Elimination of work books, ete... нее 12 21 25 43 
4. Variety of material for creative work........... 40 87 50 86 
E. Classification: ; | з 
1. Provision for pre-school еЇіпісв...............: 32 70 38 66 
2. Use of reading-readiness tests......... e| 38 83 37 64 
3. Delay in beginning reading program eese] 29 | 63 | 29 | 50 
4. Groupings by social age, rather tban by intelli- Т d К 
gence or achievement... ее reer eee 17 37 15 26 
5. Use of no failure program.... 12 26 16 28 
6. Reduction in pupils retained : 35 76 41 Al 
7. Use of tests for guidance, not promotion.......- 37 80 46 79 
8. Reduction in number of pupils per teacher. . ... . 19 4l 27 47 


a i z 


356 MEASUREMENT IN INSTRUCTION 


educational programs are shown. It is also apparent that much yet remains 
to be done. But it is encouraging to note that somewhat more than a third 
of these schools were using reading-readiness tests and other tests for diag- 
nostie and guidance purposes rather than merely as a basis for promotion. 


B. The Activity Movement 


In recent years no program of instruction has received more attention 
among educators than the activity movement, usually a prominent feature 
of the so-called “progressive schools." Yet educational historians assure 
us that the principle that man learns by doing is “as old as man's earliest 
education." ? In fact, its roots lie further back than the beginning of for- 
mal education in schools. Its advocates go even further and assure us that 
it is grounded in the fundamental nature of the learner himself. However, 
there are such wide divergencies among its champions, both in theory and 
in practice, that it may be said that the activity movement not only 
recognizes individual differences to an astonishing degree, but also actually 
demonstrates such differences. The essential features of this educational 
program may be briefly, and somewhat inadeauately, described as follows: 

1. Education results from the child's own purposeful activity with proc- 
esses considered personally vital to him. An activity, according to Kil- 
patrick, is “a unitary sample of actual child living as nearly complete and 
natural as school conditions will permit."** At every stage the organism 
reacts as a whole, and the physical, intellectual, and emotional experiences 
are interrelated. 

2. Learning is inherent within the life process itself. It results naturally 
from the learner’s self-directed purposeful activity. Teaching, like learning, 
is individual in character, arising from a felt need. The teacher is only a 
guide, and all subject matter is merely a tool. The activity program clearly 
places upon the shoulders of the classroom teacher the difficult problem 
of adjustment to individual differences. 

8. Interest is at all times the motivating factor in the learning process. 
Although all teaching procedures recognize the value of interest, the activ- 
ity movement emphasizes more than any other program the importance of 
inner drives and interests of the individual pupil, as opposed to extraneous 
motivation of any kind. 

4. The development of the learner’s personality, rather than the accumu- 
lation of facts and skills, is the objective of all learning. The personality 
of each individual will develop in accordance with his own abilities, 
interests, and personal experiences. 

5. The evaluation of this relatively intangible personal development in- 


% Thomas Woody, “Historical Sketch of Activism,” in Thirty-Third Yearbook of the 
National Society for the Study of Education, Part 11, pages 9-43. Bloomington, Illinois: 
Public School Publishing Company, 1934. Quoted by permission of the Society. 

и Thirty-Third Yearbook, op. cit., page 63. Quoted by permission of the Society. 


CLASSIFICATION AND PROMOTION 357 


volves a fairly long time-span and, therefore, lends itself more to qualita- 
tive than to quantitative judgment. In the evaluation process the pupil 
himself is an active participant. According to Dewey, “the more mature 
and experienced the teacher, the less will he or she be dependent upon 
tangible, direetly applicable, external tests, and will use them, not as final, 
but às guides to judgment of the direction in which development is taking 
расе.” 25 

It should not be overlooked, however, that regardless of the relative 
emphasis, such activities as reading and arithmetic are always going to be 
important, and there appears 40 be no good reason to rely entirely upon 
subjective impressions when objective measures are available. The mere 
fact that adequate measures of the less tangible outcomes are not yet avail- 
able is no justification for neglecting the measurement of the tangibles, 
the tools for which do exist. Furthermore, the absence of suitable tools no 
more removes the need for evaluation than a lack of food relieves the pangs 
of hunger. Indeed, the need is probably greater, as Gates suggests :* 


Any scheme of education that emphasizes the nature and needs of the individual 
child, as most progressive programs do, has far greater need of measurements than 
conventional programs designed primarily to impart information and skill to pupils 
en masse. 


C. Homogeneous or Ability Groups 


Individual and group instruction. It has sometimes been erroneously 
assumed that there is a necessary conflict between individual and group 
instruction. While all learning is individual learning, it can take place in a 
group setting, and certain types of learning can take place only in a group 
setting; for the individual not only learns in the group, he learns from the 
group as well. In other words, the important question is: What kind of 
group organization best provides for individual learning? The problem is 
to find somewhere between the two extremes of a complete tutorial sys- 
tem and an out-and-out lecture system the program which represents 
the best possible compromise between that which is educationally ideal 
and that which is administratively feasible. 

Homogeneous or ability groups. Shortly after the development of 
group intelligence tests in 1917, educational leaders began to use these tests 
for grouping pupils in school. This procedure was commonly referred to as 
“homogeneous grouping.” It soon became evident, however, that such 
groups were far from homogeneous, even in intelligence, not to mention 
other characteristics. The best result that can be obtained under ordinary 
school conditions is to reduce somewhat the heterogeneity of the instruc- 
tional groups. The term “ability grouping" came into use 29.2 more accu- 
rate term, although frequently used interchangeably with “homogeneous 


Пати р. čit, рай ; | issi f the Societ y. 
5 Thirty-Third Yearbook, op. eit. page 83. Quoted by permission о му. 
= Thirty- Third ООВ, Ks cit., page 164. Quoted by permission of the Society. 


358 MEASUREMENT IN INSTRUCTION 


grouping." While much confusion still exists, many writers have recently 
attempted to make a distinction between these terms. Instructional groups 
which are made less heterogeneous in learning ability, usually by the em- 
ployment of general intelligence tests, are called "ability groups." Groups 
formed upon the basis of some common interest, social maturity, or other 
similar basis, are called “homogeneous groups." An activity in a progres- 
sive school, although made up of pupils of varying abilities, is certainly 
homogeneous from the standpoint of the objective sought. Most of the 
criticism of grouping is directed against groups formed on the basis of 
ability. Doubtless, nobody would desire a group possessing the maximum 
degree of heterogeneity, even in intellectual ability, and certainly not in 
chronological age, physical maturity, background, motivation, and the like. 
It is probable, therefore, that everybody wants a group with a certain degree 
of homogeneity. The differences arise regarding the degree and basis of the 
homogeneity.” 

Arguments for and against ability grouping. An imposing list of a 
dozen or more arguments for, and an equal number against, ability group- 
ing has been assembled.” The crucial point at issue is: Do groups formed 
upon the basis of ability aid or hinder learning? Among the alleged advan- 
tages, it is argued that ability grouping makes it easier to adapt instruc- 
tional materials and methods to the individual pupil, thereby stimulating 
bright pupils and encouraging dull pupils, with the result that achievement 
is increased and failure reduced. Among the alleged disadvantages, on the 
other hand, it is argued that the system is essentially undemocratic and 
that any gains in academic achievement are likely to be slight in amount 
and purchased at too dear a price, since the bright pupils tend to graduate 
too young and to develop a sense of superiority, while dull pupils may over- 
work or may develop a sense of inferiority. Here as always, however, it is 
impossible to decide a scientific question merely by counting the arguments 
pro and con, or by attempting to weigh the logic or fervor with which they 
are advanced. Fortunately, on this problem a considerable amount of ex- 
perimental work has been done, although most of the studies must be char- 
acterized as inadequate and inconclusive. 

The experimental evidence. Summaries of the experimental literature 
relating to ability grouping have been made by Billett,” Wyndham,” 


7 Cf. Henry J. Otto, Elementary School Organization and Administration (Second 
Edition), page 184. New York: D. Appleton-Century Company, 1944. 

* For rather complete summaries of the arguments, see: Austin H. Turney, "The 
Status of Ability Grouping," Educational Administration and Supervision, 17:23, Janu- 
ary, 1931; Ninth Yearbook of the Department of Superintendence, pages 121-126. Wash- 
ington, D. C.: National Education Association, 1931; and Ernest W. Tiegs, Tests and 
Measurements in the Improvement of Learning, pages 262-264, Boston: Houghton Mifflin 
Company, 1939. 

? Roy O. Billett, op. cit., pages 16-37. 

» Harold 5. Wyndham, Ability Grouping, pages 128-159. Melbourne, Australia: Mel- 
bourne University Press, 1934. 


CLASSIFICATION AND PROMOTION 359 


Cornell? and various writers in the Review of Educational Research.” 
In 1934 a foreign observer? commented upon the “haphazard condition" 
of the research upon the problem and pointed out that the experimental 
studies “raise more issues than they settle.” 

Ten years later an American educator,™ saw “little or no solid, objec- 
tive evidenee upon whieh to base a decision as to the effectiveness of 
homogeneous grouping as actually practiced." Cornell states the situation 
as follows: “Reviewers are generally agreed that the experimental evidence 
as to the achievement status of pupils under a plan of ability grouping is 
inconclusive.” This writer notes, however, that “опе of the most consist- 
ent results" has been the increased speed of progress possible by bright 
learners “аф every level from the first grade through college," and that a 
reduetion in the amount of failure by the less capable learners has been 
“rather consistently reported." * Her final conclusion is as follows:* 


The results of ability grouping seem to depend less upon the fact of grouping 
itself than upon the philosophy behind the grouping, the accuracy with which 
grouping is made for the purposes intended, the differentiations in content, method, 
and speed, and the technique of the teacher, as well as upon more general environ- 
mental influences. Experimental studies have in general been too piece-meal to 
afford a true evaluation of results, but when attitudes, methods, and curricula are 
well adapted to further the adjustment of the school to the child, results, both 
objective and subjective, seem to be favorable to the grouping. 


The above statement is worthy of careful study. It seems reasonably 
clear that the evil effects of ability grouping feared by its opponents need 
not occur; and, on the other hand, that the alluring advantages claimed 
by its advocates may not materialize. In other words, there is no money- 
back guarantee with ability grouping. At best, it merely affords more fa- 
vorable conditions for doing something about the problem of individual 
differences. The fundamental adjustments must be in terms of properly 
differentiated curricula and of teaching methods. On this point Otto says: 

All authorities are agreed that no classification scheme can remove the need for 
adjusting instructional materials and methods to the varying needs of pupils in 
the group.* 


Upon certain important issues, unfortunately, there has been little or no 


3 Ethel L. Cornell, "Effects of Ability Grouping Determinable from. Published 
Studies," Thirty-Fifth Yearbook of the № ational Society for the Study of Education, Part I, 
pages 289-304, Bloomington, Illinois: Public School Publishing Company, 1936. Quoted 
by permission of the Society. : 

2 At three-year intervals, beginning n Volume I, 1931. 

* Harold 8. ham, op. сй., page 156. 

es A. vines г cole for American Youth, page 290. New York: 
American Book Company, 1944. ee i 

з Ethel L. Cornell, op. cit., page 295. Quoted by permission of the Society. 

% Tbid., pages 396-397. Quoted by permission of the Society, 

а Ibid., page 304. Quoted by permission of the Society, 

% Henry J. Otto, op. cit, page 195. 


360 MEASUREMENT IN INSTRUCTION 


experimentation. No one, for example, has determined the effect of various 
methods of adapting work to pupils of different levels of ability. This is 
especially important, since the methods actually employed have usually 
been most effective for dull learners and least effective for bright learners, 
In most cases, probably, the methods have been those which are used with 
ordinary heterogeneous groups, and which appear to be least appropriate 
to the more capable individuals. 

Nor has there been any convincing experimental attack to determine the 
effect of ability grouping upon the work habits and mental health of the 
pupils. Such meager results as do exist are favorable. Maller? found evi- 
dence that such desirable social traits as co-operation were developed better 
under a system of ability grouping. It is a common observation that the 
best competition in sports, such as golf and tennis, is among those who 
“play about the same kind of game." Additional evidence that homogeneity 
is an attribute of natural social groups is afforded by the numerous studies 
which have shown that there is a positive correlation between friends of 
all ages, as well as between husbands and wives, on practically all person- 
ality traits investigated.” Partridge“ points out that several studies have 
revealed a greater similarity among friends in mental age than in chrono- 
logical age. But the main reliance so far has been upon questionnaire 
studies, of which the most extensive is by Sauvain. One study of the atti- 
tude of 645 junior high-school pupils toward ability grouping came to the 
conclusion that “the great majority are happy and satisfied . . . and that 
they accept and believe in the grouping that exists as the best situation 
for them." 4 That the opinions of parents as well as of teachers were favor- 
able to ability grouping in the cities where it was employed is indicated 
by the following conclusions: 


On the whole, where grouping is used, parents believe that children are at least 
as happy, do better work in school, and are correctly sectioned according to 
ability... . 

Teachers seem to like ability grouping somewhat more than do the parents. 

They believe that grouping improves social attitudes, leads to better work by 
pupils, and increases the happiness of children. . . . 


8 Julius Bernard Maller, Coóperation and Competition, page 163. New York: Bureau 
of Publications, Teachers College, Columbia University, 1929. 

? Helen M. Richardson, “Studies of Mental Resemblance between Husbands and 
Wives and between Friends,” Psychological Bulletin, 36: 104-120, February, 1939. 

? E. DeAlton Partridge, Social Psychology of Adolescence, Chapter V. New York: 
Prentice-Hall, Inc., 1938, 

® Walter Howard Sauvain, A Study of the Opinions of Certain Professional and Non- 
Professional Groups Regarding Homogeneous or Ability Grouping, 151 pages. New York: 
Bureau of Publications, Teachers College, Columbia University, 1934. 

^ Austin H, Turney and M. Е. Hyde, “The Attitude of Junior High School Pupils 
toward Ability Grouping," School Review, 39: 600, October, 1931, 

* Walter Howard Sauvain, op. cit., pages 115, 116, 


CLASSIFICATION AND PROMOTION 361 


The technique of ability grouping. There is no general agreement 
as to the best basis for ability grouping. In fact, there is probably no one 
“best basis.” Much depends upon the local conditions, the data available, 
the nature of the subject, the size of the school, the fundamental philosophy 
of the school, and the like. It is often true, as one writer suggests, that “the 
soundest policy in dealing with educational measurements is to obtain 
objective data and interpret them subjectively.” * Nor is there uniformity 
in either theory or practice regarding the number and size of the groups, 
the proper differentiation in methods and curricula, or the relative empha- 
sis upon acceleration and enrichment for the bright groups. 

A useful distinetion is made between vertical and horizontal classifica- 
tion. Vertical classification attempts to bring together pupils of approxi- 
mately the same status. The successive grade levels of the ordinary school 
represent such an attempt. The basis is usually CA, or some combination 
of CA, MA, and EA. The use of the average of the MA and EA, or the 
average G-score on an intelligence test and a general achievement test, has 
much to commend it in the intermediate and upper classes.“ Horizontal 
classification means that on any grade level the pupils are further divided 
according to ability, or rate of learning. For this ability grouping in the 
academic subjects, the IQ, or a combination of IQ and CA, is probably 
most often employed. Boyer shows how a two-way distribution of IQ and 
CA, divided by horizontal and vertical lines, may be used effectively for 
this purpose.” In the high school, aptitude tests are sometimes better than 
general intelligence tests. In other words, the purpose is to bring together 
for instructional purposes those pupils who represent approximately the 
same educational and mental status, and who are capable of progressing 
in the subject at about the same rate. The system should be flexible enough 
to permit the shifting of pupils from one group to another in any subject 
whenever it is evident they are improperly classified in that subject. For 
non-academic subjects, such as woodwork and music, for extracurricular 
activities, and possibly for the homeroom in high school, the groups may 
be as heterogeneous as the population of the school. Such a flexible program 
is inherently democratic. Small schools are of necessity limited to informal 
groupings made within the classroom. > 
у But the most important problem of adjustment yet remains for the 
classroom teacher. She must study the individual pupils in her class, when- 
ever necessary must divide them into temporary groups for remedial in- 
struction, and must vary the instruetional materials and teaching methods 
as conditions seem to warrant. In the last analysis, the adjustment of the 


$ Jacob S. Orleans, Measurement in Education, page 286. New York: Thomas Nelson 
and Sons, 1937. : 

46 Cf, William M. McCall, Measurement, Chapter XI. New York: The Macmillan 
Company, 1989. : 

 Thirty-Fijth Yearbook, op. cit., pages 199-208. _ 


362 MEASUREMENT IN INSTRUCTION 


school to individual differences becomes a teaching problem. As McCall 
says, “But after all, how pupils are taught and not how they are grouped 
is the vital matter.” Turney puts the matter concisely: 


The actual sectioning is but a minor part of ability grouping; the real job rests 
with the teachers. To adjust subject matter so that a child cam use his mental 
ability, and to adjust method so that he will use it—these are the outstanding 
problems, for it is idle to talk of effective development unless children ean and do 
use their mental ability. 


Special classes are sometimes formed for pupils at the extremes of the 
distribution, although in high school those for the very slow learner are 
much more frequent than those for the very bright. In such classes the 
teaching is highly individualized. Patience, skill in diagnosing pupil diffi- 
culties, and training in mental hygiene are important qualifications for 
teachers of slow classes. High intelligence, versatility, sound scholarship, 
and a thorough grounding in psychology are essential qualifications for 
teachers of special classes for bright pupils. 

It is doubtless possible to overdo the idea of “special” classes and schools 
of one sort or another. Although in a real sense every pupil is unique and 
should receive special attention, it would certainly be a grave mistake to 
become so occupied with the "exceptional" pupils as to overlook adequate 
educational provision for the larger group, who are, to all intents and pur- 
poses “perfectly normal." This situation has been satirized ag follows: 


Johnny Jones has lost a leg, 

Fanny's deaf and dumb, 

Marie has epileptic fits, 

Tom’s eyes are on the bum. 

Sadie stutters when she talks, 

Mabel has T.B., 

Morris is a splendid case 

Of imbecility. 

Billy Brown's a truant, 

And Harold is a thief, 

Teddy’s parents gave him dope 

And so he came to grief. 

Gwendoline’s a millionaire, 

Gerald is a fool: 

So every one of these darned kids 

Goes to a special school. 

They’ve specially nice teachers, 

And special things to wear, 

And special time to play in, 

And a special kind of air. 

They’ve special lunches right in school. 
^ William A. McCall, op. cit., page 168. 
® Thirty-Fifth Yearbook, op. cit., pages 113-115. Quoted by permission of the Society. 
* Roy О. Billett, op. cit., page 196. 
м Elmer Harrison Wilds, The Foundations of Modern Education, page 523. New York: 

Farrar & Rinehart, Ine., 1942. 


CLASSIFICATION AND PROMOTION 363 


While I—it makes me wild!— 
I haven't any specialties, 
I'm just a normal child, 


Acceleration and retardation. In the elementary school a common 
device for reducing the heterogeneity of the class is to eliminate the ex- 
tremes of the distribution at promotion time. To do this a small number of 
the most capable pupils are allowed to “skip” a grade or half grade, and 
usually a larger number of the least capable pupils are “failed,” or “‘re- 
tained" in the same grade for another year or half year. Witty and Wilkins 
published a critical survey of the literature relating to acceleration, and, 
in spite of certain limitations in the studies, concluded that “most reports 
show clearly that acceleration, when practiced, is associated with desirable 
adjustment in all types of development for which data have been as- 
sembled." One experiment in which pupils allowed to skip a grade were 
paired with pupils of like ability not skipped, led to the conclusion that 
“under reasonably favorable conditions skipping is a satisfactory method 
of accelerating pupils of superior ability.” 

Recent studies have attempted to determine the effect of acceleration 
upon the pupils’ personality and social adjustments in high school and 
college, apparently accepting Terman’s verdict regarding the academic 
achievement of superior pupils: “The earlier they enter college the better 
work they do there, at least down to an entrance age of 15 years.” * 
Almost without exception the results appear to be favorable. Engle, for 
example, found that accelerated students in high school when compared 
with other students of their own chronological age were “at least as active 
socially as non-accelerated students." 5* In 1943, Pressey * surveyed the lit- 
erature regarding acceleration on the college level and came to the conclu- 
sion that “the great majority of accelerated students do well in school, are 
socially adjusted, do not suffer in health, and are not handicapped in after- 
school career.” 

During World War II great emphasis was placed upon accelerated pro- 
grams of education, particularly on the college level. Several colleges have 
attempted to investigate the effect of these programs upon the students. 


® Paul А. Witty and Laroy W. Wilkins, “The Status of Acceleration or Grade Skip- 


ping as an Administrative Practice,” Educational Administration and Supervision, 19: 


321-346, May, 1933. Тай 
% Jesse E. Adank and C. C. Ross, “Is Skipping Grades a Satisfactory Method of 


Acceleration?” American School Board Journal, 85: 24-25, July, 1932. ; 
Lewis M. "Tara “The Gifted Student and His Academic Environment,” School 


nd Sociei : 21, 1939. . 
ic Thelen Le Cagle, “A. Study of the Effects of School Acceleration upon the Per- 
sonality and Social Adjustments of High-School and University Students,” Journal of 


Educationa 29: 523-529, October, 1938. р ae 
[ree eco. versus Lock Step," Educational Research Bulletin, 22: 


29-35, February 17, 1943. 


364 MEASUREMENT IN INSTRUCTION 


Studies directed by Pressey at Ohio State University have been especially 
noteworthy. With few exceptions the results have favored acceleration. 
Another investigation? concludes that many more superior women stu- 
dents than usually attempt it “can complete в, college program in three 
years or less without unfortunate effects as regards scholarship, recreation, 
health, or after-school career." 

In a comprehensive, heavily documented monograph Pressey summa- 
rizes the literature on acceleration and calls upon educators to break the 
lock-step curriculum at all levels for the abler students: 


There is evidence to suggest that superior children might best begin school some- 
what younger than average children—that school entrance should be on the basis 
of total maturity rather than simply on arrival at the chronological age of six. In 
elementary and secondary schools, even though accelerates have usually not been 
selected carefully on any broad basis or helped to adjust to the more advanced 
work and new companions, a majority of them have nevertheless continued to show 
academic superiority and goodsocial adjustment after acceleration. When accelerates 
have been carefully chosen as intellectually superior and well adjusted, means 
provided for facilitating advancement, as by adjustment classes. or sections ar- 
ranged for superior students who then move forward together to cover perhaps 
three years of junior-high school work in two, results have been found almost 
distressingly satisfactory. Such students do well in later public-school work (as in 
senior high school) and in their relationships with pupils in the classes into which 
they have been advanced, sometimes even better than students of equal ability 
and academie record at the time the special program of acceleration began who 
continued at the regular pace. The conclusion seems unavoidable that the usual lock 
step grossly wastes the time of the ablest young persons. In college, students who 
entered early as a result of acceleration similarly show the highest academic record, 
the lowest academic mortality, and high participation in campus life, 


The weight, both of the arguments and of experimental evidence, ap- 
pears to be against failure or retardation as a school policy. In Otto’s™ 
survey the literature indicated that about 20 per cent of repeaters do better 
and 40 per cent do worse than before. He concluded that if the objective 
of the modern school is the optimum development of its pupils, “поп- 
promotion is not the way to get it.” Several studies indicate the value of 
trial promotions. An investigation by McKinney,* for example, involving 
more than 13,000 pupils, shows a saving of about three out of every four 


Я Су. 8. L. Pressey and S, В. Folk, “First Evaluations of an Accelerated Program 
in a College of Engineering,” Journal of Engineering Education, 34: 477-485, March, 
1944; S. L. Pressey, "Acceleration the Hard Way,” Journal of Educational Research, 
37: 561-570, April, 1944, 

% Marie A. Flesher, “Ап Intensive Study of Seventy-Six Women Who Obtained Their 
Undergraduate Degrees in Three Years or Less," Journal of Educational Research, 39: 
602-612, April, 1946. 

9 Sidney Г. Pressey, Educational Acceleration; Appraisals and Basic Problems (Bu- 
reau of Educationa) Research Monograph No, 31), page 143, Columbus, Ohio: Ohio 
State University, 1949. Italics added. 

* Henry J. Otto, op. cit., page 232, 

^ H. T. McKinney, Promotion of Pupils, A Problem in Educational Administration, 
206 pages. Urbana, Illinois: University of Illinois, 1921. 


CLASSIFICATION AND PROMOTION 365 


repeaters. One study has shown that the threat of failure affords ineffective 
motivation.” Certainly, with a modern curriculum and an adequate pro- 
gram of diagnosis and guidance, few if any failures should occur. 

Continuous promotion. Otto* has proposed a somewhat theoretical 
but very suggestive promotion plan for the elementary school, which abol- 
ishes not only acceleration and nonpromotion but the term “school grade" 
as well. Such a type of organization has been in successful operation in 
several school systems for a number of years. 

His plan involves the following five essential features: 


1. There would be available extensive data of an objective character on 
each child, so that he may be "placed at all times in groups in which he can 
work to the best advantage in terms of his own developmental readiness." 

2. There would be continuous pupil adjustment and progress, with shifts 
from one group to another “at any time during the year that a change 
would seem advisable." 

3. The major classifications which take place in the ordinary school at 
the beginning of each term would be eliminated. 

4. It would make possible longer teacher-group relationships in which 
“the same teacher works with the same group of children for two or three 
consecutive semesters or years.” 

5. The conventional competitive marking system would be replaced with 
“extensive, objective, cumulative data on many aspects of the growth and 
development of each child.” à 


SELECTED REFERENCES FOR FURTHER READING 


Berkshire, James R., “Improvement in Grading Practices for Air Training Com- 
mand Schools,” ATRC Manual 50-900-9, Headquarters, Air Training Command, 
Scott Air Force Base, Illinois, June, 1952. 39 pages. т 

Cook, Walter W., “The Functions of Measurement in the Facilitation ot Learning," 
Chapter 1 in E. F. Lindquist (Editor), Educational Measurement. Washington, 
D. C.: American Council on Education, 1951. 

Crow, Lester D., and Crow, Alice, Educational Psychology. New York: American 
Book Company, 1948. Chapters 11 and 27, "Educational Implications of In- 
dividual Differences” and Adjustment of Exceptional Individuals." 

Educational Policies Commission, Education of the Gifted. Washington, DiC: 
National Education Association, 1950. 88 pages. 

Ellis, Robert S., Educational Psychology, a Problem Approach. New York: D. Van 
Nostrand Company, 1951. Chapters II, XII, and. XV: "The Admission and 
Classification of Pupils, Their Achievement, and Their Elimination from School" ; 
“Classroom Tests and Marks"; and “Exceptional Children.” 

Gilbert, Arthur W., “Design and Pattern of the Curriculum,” Review of Educational 


Research, 21: 196-208, June, 1951. 


“Ап Attempt to Evaluate the Threat of Fail- 


е . Melby, i 
Henry J Ottoaod ae tary School Journal, 35: 588-596, April, 1935. 


ure as a Factor in Achievement,” Elemen 
5 Henry J. Otto, op. cit., 286-242. 


366 MEASUREMENT IN INSTRUCTION 


Hildreth, Gertrude H., Educating Gifted Children at Hunter College Elementary 
School. New York: Harper & Brothers, 1952. 272 pages. 

Hollingshead, Augustus B., Elmtown’s Youth; the Impact of Social Classes on 
Adolescents. New York: John Wiley & Sons, 1949. Chapters 8 and 13, “The High 
School in Action” and "Leaving School.” 

Mackenzie, Gordon N., and Bebell, Clifford, “Curriculum Development," Review 
of Educational Research, 21: 227-237, June, 1951. 

Pressey, Sidney L., Educational Acceleration (Bureau of Educational Research 
Monograph No. 31). Columbus, Ohio: Ohio State University, 1949. 153 pages. 

Terman, Lewis M., and Oden, Melita H., The Gifted Child Grows Up. Stanford, 
California: Stanford University Press, 1947. 448 pages. 

Witty, Paul (Editor), The Gifted Child. Boston: D. C. Heath and Company, 1951. 
338 pages. 


M 


Evaluation in Guidance 


The field of guidance developed rapidly from 1918 onward, concurrently 
with the beginnings of mass group testing. Now a vast body of information 
and techniques must be mastered before one can qualify as a competent 
professional guidance worker, so it is not feasible in an elementary measure- 
ment textbook to do more than make a few brief remarks concerning eval- 
uation in school guidance programs. The interested student will want to 
sean the selected references at the end of this chapter for supplementary 
reading material. 


A. The Meaning and Importance of Guidance 


The fundamental problem of life is adjustment. At birth the human 
infant is much less well adjusted to the world in which he must live than 
many of the simpler organisms. Man’s dominant place in the universe is 
due largely to his remarkable capacity for modifying his reactions in the 
direction of a more adequate adaptation to the conditions under which he 
must live. The process by which these changes take place is called learning, 
and the result is called education. The function of the school is to provide 
a favorable environment in which these changes may take place. The role 
of the classroom teachers and of the school administrators is to stimulate 
and to direct the learning process. 1 1 

The aim of all guidance is to assist the learner to acquire sufficient under- 
standing of himself and of his environment to be able to utilize most intel- 
ligently the educational opportunities afforded by the school and the 
community. 'The problem of guidance arises f rom the fact that an immature 
but growing individual with a unique combination of abilities and limita- 
tions is confronted with a complex and ever-changing environment. Guid- 

367 


368 MEASUREMENT IN INSTRUCTION 


ance used to be regarded as an effort “to see through J ohnny and to see 
Johnny through." The emphasis today has shifted to an effort “to help 
Johnny see through himself and to see himself through.” 1 It seeks to assist 
each student to choose, and make satisfactory progress in, those activities 
which will contribute most to his development, individual happiness, and 
social worth. 

Certain circumstances have conspired to make guidance one of the most 
important responsibilities of the modern school. This is partieularly true 
of the secondary school and of the college. Figure 12 on page 248 shows 
that in 1900, 4 per cent of all students in the United States were enrolled 
in secondary schools and 1.4 per cent in institutions of higher learning, 
while in 1953 the corresponding percentages were 18.1 and 6.2. During this 
same period the total number of students doubled, rising from 17,200,000 
to 34,400,000. As a result of these conditions, the student body of the 
modern secondary school and college represents a greater diversity of back- 
grounds, interests, ambitions, and abilities than has ever been true before. 

At the same time science and invention have greatly complicated and 
are constantly changing the social and economic world from which these 
pupils come and to which they must return. Likewise, the school situation 
itself, academically as well as socially, has markedly increased in complex- 
ity. The small high school with a single curriculum leading to college has 
tended to give way to larger schools with a more diversified program. 
Judd? called attention to the fact that the number of subjects offered in 

` American high schools increased from 9 in 1890 to more than 250 in 1942. 
At the present time a pupil of high-school age in a large modern American 
city has a choice of a score or more different curricula. 

As it is always easier for the traveler to lose his way in a large city than 
in a small town, especially if he is lacking in maturity and experience, it is 
perhaps not surprising that many of those who enter the modern secondary 
school and college never succeed in making a satisfactory adjustment to 
these institutions. Likewise, the vast number of adolescents and adults 
who find their way into penal institutions or into hospitals for the physi- 
cally and the mentally ill, and the much larger number of others who lead 
unhappy and unsuccessful lives afford tragic evidence that adjustment 
outside the school has been equally unsatisfactory. There seems no escaping 
the fact that when the conditions of life increase in complexity, the need 
for guidance increases proportionately. The better the guidance program 
the less will be the need for diagnostic and remedial work later on. An ade- 
quate guidance program is the best form of prevention. 


1 George E. Myers, Principles and Techniques of Vocational Guidance, page 4. New 
York: MeGraw-Hill Book Company, 1941, 

* Charles Н. Judd, “General Education and the Baccalaureate Degree,” School and 
Society, 56: 35, July 11, 1942, 


EVALUATION iN GUIDAN( 'E 369 
B. The Place of Measurement in Guidance 


Two errors are common in assigning the place of measurement in guid- 
ance. The first of these, fortunately less common now than in the early days 
of standard testing, is to think of guidance as synonymous with testing. 
Guidance ts always more than the giving of tests, no matter how extensively or 
carefully done. As a matter of fact, whether or not tests serve any guidance 
function depends upon the use made of the results. Here, as elsewhere, 
tests are merely tools. The second error, very common today, is to dismiss 
measurement altogether and to regard it as wholly unessential to guidance 
if not indeed an actual obstacle. This viewpoint is as extreme as the first, 
however. While testing is never everything in guidance, it is almost always 
something. In fact, it may be asserted confidently that evaluation in some 
form is implicit in the guidance function. Properly used tests are valuable 
aids to self-analysis. 

Fowler* indicates that many common mistakes will be avoided if users 
of tests in guidance will remember at all times that “the only justifiable 
reason for using tests in the guidance program is to serve the individual 

' inventory in counseling." He formulates seven “guiding rules” based upon 
this point of view: 


1. Any item of the individual inventory, whether it be a test score, a teacher’s 
mark, a fact about the pupil’s health, can be interpreted in the counseling situa- 
tion only in the light of all the other inventory data having some bearing on the 
problem at hand. This is to say, a chief value of test scores is the check which 
they provide upon the meaning of other accumulated facts. In turn, the importance 
to be accorded test scores in any given case must be weighed in the light of other 
data from the individual inventory. Dependence must be placed upon tests to 
supply facts when they have not been accumulated through other means. | 

2. Test scores, like other items in the inventory, must be interpreted cautiously 
until norms are scientifically established for the local situation and for the par- 
ticular kind of problem which the pupil presents. Р 

3. Тһе meaning of а test score may not be ће same from one pupil to another 
because of the differences in other pertinent inventory data. The meaning may 
change even for the same pupil from one problem to another or from one time to 
another. 6 n 

4. Real counseling will eneourage decisions or judgments only on the basis of 
as full an inventory of pertinent facts as possible. Thus several measures are usually 
better than just one or two. Likewise, the same dependence will not be placed 
upon so-called “interest” or “personality” tests as upon achievement and aptitude 
tests. J 
5. It is recognized that certain tests are regularly used in the school = бв 
administrator in pupil classification and curriculum planning. They are us y 
teachers in individualizing teaching methods. The data from these same an are 
of even greater use for counseling and should always be recorded in the cumu ae 
record. Tests used hy the administrator for these purposes may supplement the 


è Fred M. Fowler, “To Inquirere about Tests,” Education for Victory, 3: 12-13, De- 
cember 4, 1944, 


370 MEASUREMENT IN INSTRUCTION 


tests used only by the counselor. This fact should not be overlooked in their 
choosing. 

6. Tests are best used as aids to counseling, rather than as standards for arbi- 
trary selection (or rejection) for training and job opportunities. 

7. Familiarity with a test, gained through its use, is important. In deciding to use 
a new test to measure the same traits, loss of this familiarity should be weighed 
sarefully against the possible gain in reliability, validity, usability, and economy. 


If these suggestions are kept in mind, the dangers against which Rogers‘ 
warns will be avoided. 

Counseling is the process of assisting the individual in making the maxi- 
mum adjustment to the educational opportunities of his environment in 
terms of his abilities, interests, and needs. It is normally a face-to-face 
relation between an older, more experienced person and a less mature 
person. It is an example of co-operative problem solving. The role of the 
counselor is not to make decisions for the pupil, but rather to help him 
to solve his own problems intelligently. The pupil's part naturally inereases 
with his maturity, the ultimate purpose of counseling being so to develop 
the pupil's self-reliance that outside help becomes progressively unneces- 
sary. ' 

Counseling is not a new development in education. In some type or other 
it has probably existed longer than formal education itself. The difficulty 
has been not with the amount of counseling available but with its quality. 
Nowhere is the wisdom of the homely American: philosopher, Josh Billings, 
more apparent than in counseling: “It is better to kno less, than to kno so 
mutch that ain't so." The improvement of measurement techniques in the 
present century and the development of cumulative record forms have 
made it possible to substitute factual data for opinion and hearsay. 


C. Guidance Is a Co-operative Venture ui 


In a very real way all persons in contact with the child—teachers, 
administrators, specialized educational personnel, clinic staffs, parents, 
ministers, law enforcement officials, and others in the community—are 
guidance workers. They must work together effectively in order to pre- 
vent juvenile delinquency, inadequate benefit from schooling, vocational 
frustration, neurotic inadequacy, and mental illness. This calls for continu- 
ous, co-ordinated effort by everyone. No one person or agency can shoulder 
the burden alone, for no single group has the necessary training, experience, 
and facilities. Legal, medical, social work, economic, and political consid- 
erations often loom fully as important as purely school-based educational 
aspects. The teacher should remember eternally that his is a co-operative 
task where guidance is concerned. Though it is undoubtedly true in a cer- 
tain sense that “teaching is guidance,” much more guidance than can be 


5 Carl R. Rogers, “Psychometric Tests and Client-Centered Counseling," Educational 
and Psychological M. easurement, 6: 139—144, Spring, 1946, 


EVALUATION IN GUIDANCE 371 


given in the classroom under present conditions is essential if America's 
children are to develop optimally. 


SELECTED REFERENCES For FuRTHER READING 


Arbuckle, Dugald §., Teacher Counseling. Cambridge, Massachusetts: Addison- 
Wesley Press, 1950. 179 pages. 

Bennett, George K., and Seashore, Harold, “Testing for Vocational Guidance in 
High School,” pages 71-79 in Henry Chauncey (Chairman), 1947 Invitational 
Conference on Testing Problems, “Exploring Individual Differences.” American 
Council on Education Studies, Series I, No. 32, Vol. XII, October, 1948. 

Bennett, George K., Seashore, Harold G., and Wesman, Alexander G., A Manual 
for the Differential Aptitude Tests (Second Edition). New York: Psychological 
Corporation, 1952. 77 pages. 

Bledsoe, Joseph C., “Success of Non-High School Graduate GED Students in 
Three Southern Colleges,” College and University, 29: 381-388, April, 1953. 

Coleman, William, and Cobb, E. B., Guidance Use of Test Results. Professions? 
Manual No. 2, Tennessee State Testing Program. Knoxville, Tennessee; Uni- 
versity of Tennessee, November, 1951. 47 pages. 

Crawford, Albert Beecher, and Burnham, Paul Sylvester, Forecasting College 
Achievement: A Survey of Aptitude Tests for Higher Education, Part I. New Haven: 
Yale University Press, 1946. 291 pages. 

Darley, John G., and Anderson, Gordon V., “The Functions of Measurement in 
Counseling," Chapter 3 in E. F. Lindquist (Editor), Educational Measurement. 
Washington, D. C.: American Council on Education, 1951. 

Davis, Frederick B., Utilizing Human Talent. Washington, D. C.: American Coun- 
cil on Edueation, 1947. 85 pages. 

Davis, Frederick B. (Editor), “Educational and Psychological Testing," Review of 
Educational Research, 23: 1-110, February, 1953. See reviews of the 1949-52 
literature concerning tests of general mental ability (Julian C. Stanley), tests of 
special aptitude (William G. Mollenkopf), nonprojective tests of personality and 
interest (David V. Tiedeman and Kenneth W. Wilson), and projective tests of 
personality (John W. M. Rothney and Robert A. Heimann). 

Erickson, Clifford E., The Counseving Interview. New York: Prentice-Hall, Inc., 
1950. 174 pages. 3 
Frederiksen, Norman, and Schrader, W. B., “The ACE Psychological Examination 
and High School Standing as Predictors of College Success,” Journal of Applied 

Psychology, 36: 261-265, August, 1952. 

Froelich, Clifford P., Guidance Services in Smalter Schools. New York: McGraw-Hill 
Book Company, 1950. 352 pages. ? 

F'roehlieh, Clifford P., and Darley, John G., Studying Students; Guidance Methods 
for Individual Analysis. Chicago: Science Research Associates, 1952. 411 pages. 

Lefever, D. Welty, Turrell, Archie M., and Weitzel, Henry L, Principles and Tech- 
niques of Guidance. New York: Ronald Press, 1950. 577 pages. "n 

Rogers, Car] R., Client-Centered Therapy; Its Current. Practice, Imptications, and 
Theory. Boston: Houghton Mifflin Company, 1951. 560 pages. . 

Rothney, John W. M., and Roens, Bert A., Guidance of American Youth; an Experi- 
mental Study. Cambridge, Massachusetts: Harvard University Press, 1950. 


269 pages. 


372 MEASUREMENT IN INSTRUCTION 


Rothney, John W. M., and Danielson, Paul J., “Counseling,” Review of Educational 
Research, 21: 132-139, April, 1951. 

Rothney, John W. M., “Interpreting Test Scores to Counselees," Occupations, 30: 
320-322, February, 1952. 

Strang, Ruth, ‘Major Limitations in Current Evaluation Studies," Educational 
and Psychological Measurement, 10; 531-536, Autumn (Part 2), 1950. 

Strang, Ruth, “7 Ways to Improve the Rating Process,” Occupations, 29: 107-110, 
November, 1950. 

Super, Donald E., Appraising Vocational Fitness by Means of Psychologicat Tests. 
New York: Harper & Brothers, 1049. 727 pages. 

Traxler, Arthur E., and Townsend, Agatha (Editors), Improving Transition from 
School to College. New York: Harper and Brothers, 1953. 165 pages. 

Willey, Roy D., Guidance in Elementary Education. New York: Harper and Broth- 
ers, 1951. 825 pages. 

Williamson, E. G., and Foley, J. D., Counseling and Discipline. New York: 
MeGraw-Hill Book Company, 1949. 385 pages. 

Wolfle, Dael, and Oxtoby, Toby, “Distributions of Ability of Students Specializing 
in Different Fields," Science, 116° 311-314, September 26, 1952. 


15 


Evaluation of Schools 


A. The Problem of Evaluation 


The motivational effects of evaluation programs have long been evident in the 
schools. If achievement is rated by tests, both teachers and pupils work to pass the 
tests. If progress is appraised in other ways, activities related to these methods of 
evaluation are evident in the daily or weekly school program. The modern point 
of view is that evaluation is not a series of periodic examinations applied externally 
but an intrinsic part of the learning process with its planning, evaluating cycle. 
Tests do, of course, have their places in this process. Viewed thus, methods of 
evaluation can be one of the most valuable tools for creating interest and purpose 
in further learning.! 


Measurement and evaluation. As used in education, evaluation is a 
far more inclusive concept than measurement. Two aspects of evaluation 
may be distinguished: (1) data relating to some important aspect of the 
school, such as its organization, program, or results; and (2) a set of values 
or standards against which these data are interpreted and appraised. Fur- 
thermore, the evaluator's educational philosophy and sense of values will 
determine what objectives of the school program he considers to be impor- 
tant, as well as what data ne will look for, or regard as relevant in the 
situation. It is apparent that while measurement may be highly mechanical 
and at times a routine, evaluation can never be; at every stage evaluation 
requires the exercise of mature judgment. 

Meastirement implies the use of some tool or instrument, such as a test 
or scale, and provides a quantitative description of observed phenomena. 
OD 5 id H. Russell, “Motivation in School Learning,” page 64 
in pe ps icem Society for the Study of Education, Part I 
(Learning and Instruction), Chicago: University of Chicago Press, 1950. Quoted by 


permission of the Society. 
873 


374 MEASUREMENT IN INSTRUCTION 


This is always desirable, but it should never exclude relevant data of a 
subjective and qualitative character, or the consideration of outcomes not 
immediately observable. Some writers have criticized existing measure- 
ment in education for the reason that it furnished inadequate data for 
evaluation.? At best, measurement merely provides data needed for eval- 
uation; it is not evaluation per se. 

The Sixteenth Yearbook? of the Department of Elementary School Prin- 
cipals of the National Education Association presents a good discussion 
of evaluation on the elementary level. The ten chapters of this report are 
given below: 


I. The Fundamentals of School Appraisal 
II. Appraising the School Organization 
III. Appraising Administrative and Supervisory Procedures 
IV. Evaluating the Curriculum 
V. Appraising Methods of Learning and Teaching 
VI. Evaluating Socializing Experiences 
VII. Measuring the Progress of Pupils 
VIII. Estimating the Efficiency of Teachers 
IX. Judging School Equipment 
X. A Review of the Technies of Appraisal 


Note that the term *measuring" occurs in only one chapter heading. The 
other terms, “appraising,” “evaluating,” “estimating,” and “judging,” all 
similar in meaning, imply the use of techniques that go beyond testing and 
examining. 

The Eight-Year Study of the Progressive Education Association,* which 
included both elementary and secondary education, and the Three-Year 
Study of the Commission on Teacher Education® on the college level are 
illustrations of an enlarged conception of evaluation. The committee sought 
to devise suitable instruments of measurement for outcomes—such as in- 
terests, attitudes, creativeness, and various aspects of thinking—less tan- 
gible than those measured by ordinary tests and examinations. It also 
utilized other types of data, such as anecdotal records, family histories, 
records of the pupil’s activities, and the like. 

Possibly no more ambitious example of this enlarged conception of eval- 
uation is available than the Cooperative Study of Secondary School Stand- 


* Cf. Verner M. Sims, “Educational Measurement and Evaluation,” Journal of Edu- 
cational Research, 38: 18-83, September, 1944, 

3 “Appraising the Elementary-School Program," The National Elementary Principal, 
16: 227-655, July, 1937. 

* Eugene R. Smith, Ralph W. Tyler, and Evaluation Staff, Appraising and Recording 
Student Progress, 550 pages. New York: Harper & Brothers, 1942, 

° Maurice Е. Troyer and C. Robert Pace, Evaluation in Teacher Education, 368 pages: 
Washington: American Council on Education, 1944, 


EVALUATION OF SCHOOLS 375 


ards.* Table 37 shows the six main methods the study employed, of which 
only one represents the use of measurement in the ordinary sense. 


TABLE 37 


Tur Mars Метнорѕ or EVALUATION Usep BY THE COOPERATIVE STUDY OF 
Seconpary ScHooL STANDARDS, wrrH THE WkrGHT ÁAsstGNED TO Eacu 


Method Per Cent 
1. Evaluative Criteria 40 
A. Educational Program 20 
Curriculum 
Pupil activity 
Library 
Guidance 
Instruction 
Outcomes 
В. Organization and Plant. 20 
Вай... аа 
Administration 
Plant. 2.204325 masts ERR ESSE 
2. General Judgments by Visiting Committees 20 
З. Growth as Measured by Standard Tests. . . DU 
4. Success of Pupils: ИЕ 10 
A. In College..... 
B. Noncollege....... 
5. Judgment by Pupils. . 6 
6. Judgment by Parents. . 4 
Total., 2... S П З TOES mee зА зи ЕТ sth o 100 


The importance of evaluation. Without some form of evaluation 
everything about education becomes a matter of blindly hoping that. all 
is well. In the critical period shortly before the Civil War, Abraham Lincoln 
began an important address with this statement: “If we could first know 
where we are and whither we are tending, we could better judge what to do 
and how to do it." It is no less true in education than in government that 
we must first “know where we are,” and especially “whither we are tend- 
ing,” before we are in a position to judge intelligently regarding “what to 
do and how to do it.” In the final analysis, it is the function of all attempts 
at evaluation to afford a basis for rational action. Apparently, educators 
have always recognized this, even if often somewhat vaguely. For example, 


i i ‘al Report 

'F f this study, see Evaluation of Secondary Schools, Gener: Я 
526 тога E Do (OON Study of Secondary School Standards, 1939. 
For a briefer statement, see How to Evaluate a Secondary Schoor (1940 Edition), 139 


pages, Washington, D. C.: Cooperative Study of Secondary School Standards. 


376 MEASUREMENT IN INSTRUCTION 


a college president said: ‘“‘Self-criticism and self-appraisal (now a ‘self sur- 
vey’ or an evaluation’) are as old as education.’’? 

The emphasis today is more and more upon the importance of self- 
evaluation. This holds true for all levels of education from the activity of 
the pupils in an elementary class passing judgment upon the success of a 
unit of instruction planned and executed by themselves to the formal 
Report of the Harvard Committee.’ 

Many years ago Thorndike pointed out that “the actual changes wrought 
in boys and girls by this or that form of education are being measured, 
old and new methods are being tested by experiment in the same spirit of 
zeal and care for the truth that animates the man of science, and the edu- 
cational customs which have been accepted unthinkingly by ‘use and wont’ 
are being required to justify themselves to теаѕоп.”? Although it is probably 
true that more progress has been made in that direction since the statement 
above was written than in all the centuries preceding, improvements in 
evaluation procedures have hardly kept up with those in curricula and 
teaching methods.” 

The difficulty of evaluation. The problem of evaluating education is 
immensely complicated. Many approaches toward a solution have been 
made, and none has been entirely satisfactory, For many years the various 
regional associations attempted to evaluate the secondary schools and col- 
leges of America indirectly by their possessions rather than directly by 
their products. Such measures as the size and qualifications of the staff and 
the number of books in the library were at best indications of educational 
opportunity; and even to a less extent were such things as the number or 
type of buildings in the school plant and the amount of financial support 
available. The limitations of such a procedure have been characterized as 
follows: The standards used were mechanical, rather than vital; rigid, 
rather than flexible; deadening, rather than stimulating; traditional, rather 
than progressive; academic, rather than liberal; broadly comprehensive 
and subjective, rather than scientific." Intensive and extensive study of the 
problem by committees within the past decade has increasingly revealed 
its complexity. The Cooperative Study of Secondary School Standards, for 
example, extended over a period of six years and cost about a quarter of a 
million dollars. It employed the six major methods of evaluation given in 


"Henry M. Wriston, “A Critical Appraisal of Experiments in General Education,” 
Thirly-Highth Yearbook of the National Society for the Study of Education, Part II, page 
303. Bloomington, Illinois: Public School Publishing Company, 1939. Quoted by per- 
mission of the Society. 

в General. Education in a Free Society, 267 pages. Cambridge, Massachusetts: Har- 
vard University Press, 1945. р 

* Edward L. Thorndike, Education, A First Book, pages 7-8. New York: The Mac- 
millan Company, 1912. 

1° Cf. Pedro D. Orata, "Evaluating Evaluation," Journal of Educational Research, 
33: 641-661, May, 1940. i 

1! Evaluation of Secondary Schools, General Report, op. cit., pages 53-55. 


EVALUATION OF SCHOOLS 377 

Table 37, and developed three scales, whose composition is given in Table 

38. It will be noted that the complete scale, Alpha, includes 110 different 
TABLE 38 


COMPOSITION oF 1940 EDITION OF THE ALPHA, BETA, AND Gamma SCALES Fron 
EVALUATING SECONDARY SCHOOLS 


Number of Thermometers 
Area 
Beta 


| 


Curriculum and course of study........... 
Pupil activity program...........+.+.++.- 
Library service: ЛО LE 
Guidance service........ (abes «apex dv X 
Instruction...» bie ó meane e AS 
Outcomes8...-...- оо 
School staff. ... . 

School plant,........... 

Schoo! administration............+.+-+++. 


зоны 


5 Ho Ф OE OS CO 9I Сл Оо 


№ 
Iz 


“thermometers,” all relating to the nine evaluative criteria of the first 
method listed in Table 38. In 1942 Eurich, Pace, and Ziegfeld surveyed 
the literature of the field and came to this conclusion: “No simple and in- 
expensive technique has as yet been devised nor is one likely to be devised 
that will provide an evaluation of an entire educational program." 

Evaluating teaching efficiency. A single illustration will show the 
complexity of the problem of evaluation. How can one best judge the worth 
of any particular classroom teacher? This is manifestly an important ques- 
tion. To a large extent the selection, growth, and promotion of teachers 
depend upon the answer. In general, the methods used are of three types. 
In the first place, tests and rating scales have been devised for measuring 
the personality of the teacher. As a rule, these have proved disappointing. 
The difficulty with this approach has been clearly pointed out by McCall: 
“No one has demonstrated just what causal relationship, if any, exists 
between possession of these various attributes and desirable changes in 
pupils." 13 у 

А second method attempts to measure the worth of the teacher by her 
performance, usually her activity before the class. For this purpose various 
score cards and rating scales have been devised. In fact, the typical rating 
scale attempts to secure various measures of the teacher's performance, 
together with measures of certain traits of personality deemed important 


2 Alvin C. Eurich, C. Robert Pace, and Edwin Ziegfeld, “Evaluative Studies,” Review 


of Educational Research, 12: 521-538, December, 1942. : 
Ё William A. McCall, Measurement page 403. New Yark: The Macmillan Company, 


1939. 


378 MEASUREMENT IN INSTRUCTION 


in teaching. But except as instruments of self-analysis by the teachers 
themselves, the practical value of rating scales is slight. For example, when 
the gains on the Stanford Achievement Test from November to May by 
pupils in four Wisconsin schools were used as a criterion, the correlations 
with 17 of the best-known measures of teaching ability available, although 
somewhat inconsistent, with few exceptions were so low that they could 
reasonably be supposed to have arisen from a population in which the true 
relationships were zero." 

A third method has been the attempt to judge the worth of the teacher 
by her product, the performance of her pupils. This is certainly the most 
direct, and is often asserted to be the only valid, approach. The most 
obvious way to achieve this result is to measure the improvement made 
by the pupils during a period of instruction under the teacher. But the 
problem is far more complicated than it at first appears. Even when allow- 
ances are made for differences in the intelligence and initial achievement 
of the pupils, the greater problem remains of determining how much of the 
growth is due to natural maturity and how much to the total educational 
environment in school and out of school, and the still greater problem of 
knowing how much of this improvement is due to the influence of any par- 
ticular teacher. Most competent observers today would agree with Trax- 
ler that “the use of test results for rating teachers is seldom advisable." 

A summary by Barr’ notes encouraging progress but emphasizes that 
the road ahead is long and difficult: 


The influence of any particular teacher is deeply enmeshed in a host of other 
school, pupil, and community factors. While very definite progress has been made 
in this area, it is not easy to isolate the effects of particular teachers in particular 
situations. There is reason to be optimistic about the use of more precise instru- 
ments of measurement in the management of the teaching personnel, but for the 
time being, discretion is the best part of valor. 


The Cooperative Study. One of the most ambitious attempts at evalu- 
ation by means of standard tests has been the Cooperative Study of 
Secondary School Standards, involving 198 schools and a total of over 
300,000 tests." In spite of unusual care to avoid the difficulties summarized 
in Table 39, the Cooperative Study concluded that since the results showed 
that better methods of evaluation were available for accreditation, the use 
of standard tests should be restricted to diagnostic and guidance purposes 
by the local school. The Cooperative Study also attempted to judge the 
product of the school by follow-up studies of the subsequent careers and 


“Helen M. Walker (Editor), The Measurement of Teaching Efficiency, pages 73-141. 
New York: The Macmillan Company, 1935. LX disi 


eee E. Traxler, Techniques of Guidance, page 186. New York: Harper & Brothers, 
1945. 


‘SA. S. Barr, “The Use of Measurement in the Management of Teacher Personnel," 
Education, 66: 431—435, March, 1946. 


?! Evaluation of Secondary Schools, General Report, op. cit., Chapter VIII. 


"HIA deyo 5)j40d23 pauan ‘3100425, fiappuooos fo uoypnmpag шолу A[ozirep po3d'vpy sı 


E 
APIA V 19A0 po1o33€98 s[ooqos Jo ÅJƏILIVA V 0} 4593 v FULIOY 
-STUPE JOJ SUOT}IPUOD preputjs urejqo 0} 3[nogirp St 3T "6 


"wouvApe ur апо poxoa Amy urvasload opdurs v ^o[[0j OYA sroururexo paure) Ап 
9199 JO Beye jeus v Яо Чо Aq шишиши v 03 poonpoa oq иво Азпоцир AL 

"JooQqos wt #ъ [fam въ [ooqos јо то *jusuruoatAuo [vuorjeonpe әцу dn oye 
3943 uad үрмоцәплјғиі j|» ojwutp1o-oo рив szyn оз jdurojj€ ppnous qorqa 
^[ooqos oq) jo swooone oq JO o1nswoUr v #1 AMIDO ВТЦ] YOIYM 03 doIZap ay} 'osuos 
ж шү "soris pupos oq; Jo pog oy) 03 Apurwur spjoq A[qvqoud uonoofqo eq 


"[0Oq98 oy} JO oouongut 
aq} 04 pejnquaj€ eq Ajzadoid 30uuvo ов рив 8j9vjuoo 
үооцов-о-дпо 03 onp әд Аеш souroojno oj[qu1nsuour ourog "8 

"Suruoroung MOU st 41 88 [00498 
ay} Jo IMJA = st рәивм вт Jeya svoloqA ‘Jgd eq) ur 
"pejen[sae Эшәд jooyos ey} ur pomad [wuorjonajs | eum euros jv [oouos ƏY} jo Apend əy} Yooper о} SPUI 

-u uv Яшшпр poeonpoad qyjno4£ oy} Ячытево Aq зіва ш jsvo[ 39 your oq ие oum, usa ив 98 епіпа əy} jo snjejs oy} ZuunsvojN 7A 

*ejurod 

xvo« рию #10148 əy} Яшмоце uornjnsur 943 јо 2»/04d oq) jo enjotd в 3uoso1d 
оз 81 Jomsun Ао ey) Apuorwddy *onswour 1940 «ив Jo ugy} 8893 Jo onaz oiour | urypa seoueiogrp 3unooo€ оұш oxv) Á[rodo1d you вәор 

OU рив ‘(00498 v зацепвла 10} posn 001193140 3]Du1$ uD 03 uorjoe[qo рцвл y эючА v SB uorjnjnsur UB jo ҷпәшәләов эавлэле AL 9 
"8893. "[ooqos зчәвәла 
0994494} рәзә [ә BUDY IY} JO SU ur jooqos 943 Jo quom oq, 2ur3pnf Aq рие | oq jo eyg Кәләш you ‘61100128 snowaid 77D jo YMSI 

*uorjonajsutr jo poised в safo рив 220f2q $1893 JuajeAmnba Sursn Aq your oq uro 943 вт our WATS Áuv 4B squopnjs jo 3ueuroA9rqov eu, ‘6 

*podopoaop rey os sonbruqgoo) Aue 

Aq HUN VAI 03 потр әле 819490 ров ‘54893 ләә ƏY} Aq рәлпевәш әл souroojno 
quujiodumn Auwur *19A0910]N `Ацецыо} ину ond) ѕвәр зи nay} A[pojqnopu[] 

"uor|wururexo-j[os 10] s[(ooqos Aq sj89j Jo ən o3 uorjoefqo oN 

SNH YIO YIM posn eq 0} 8119894 [10158200 10; Ploy JOU SIO 'oouuApe ur 
peoounouue speA493ur 16[nZa1 38 osod.md 5143 10] poen эле $3393 jt ‘вәл st Josue 

"uoren 

-[VA9 Jo 89444 19430 07 soseyd Zurjwruo1ogtrp oy} Zumea; ‘8914593 Aq uostreduioo 

JO 51899 т вв posn әд UVI 2100 uoururoo stu T, 'suorjnjnsut [е 03 uoururoo [eLi9) mur 
[yuorjonagsut jo Apoq әле v sr aroy} *104040]] *uorjoofqo 5144 3oeur o3 jjnoggq 

‘шои [euorjeu nsn aq) yya you рив ‘717290 2480]042$ 
Јо paa) nuns oq Jo вләцзо цим рәлтішоо oq иво juoursAemgow зип Yorn 

Quaunuo) 


"Sjuourjredop рие w[norrino ut uormnjnsur 943 


"uornujsur jo 90102} по 
juwj1odur einssour A[oyenbops jou op 83593 әјаерелу Ф 
"виот 
-вшшъхә 107 8119805 әләш OF uoronzjsut IMPA 0} рие 
v[norumo 9211998А10 0} Spud} ureigoid 3urjsoj [19198 y 'g 


"S[OOq98 
ur ѕәәпәләрір [BNPLAIPUL ezrusdooei jou вәор pur әд 
-хәрш риз uriopun oq jsnur 1318044 301389} [810098 y 'g 


"s[ooqos 1uo19gp ur Ајәріл SIYA sjuopnas jo A3mqu oq, "T 
28191914) 


gS1OOHOG DNILVOTVA^[ NI SISAL 40 4SN 


6€ УТЯУШ 


379 


380 MEASUREMENT IN INSTRUCTION 


success, both academic and nonacademic, of former pupils, and concluded 
that this method was mainly of value for local school use. A periodie can- 
vass of the opinion of pupils about the instruction and other aspects of the 
school which they are attending is also a valuable means of self-analysis 
and guidance; for, although the customer may not always be right, what 
he thinks about the institution is important, even when he is mistaken. 
In the Cooperative Study, pupil judgment also proved to be about as 
useful for evaluating schools as the elaborate testing program. 


B. General Principles of Evaluation 


For elementary schools. The Research Division of the National Edu- 
cation Association formulated the following statements of guiding prin- 
ciples for evaluating the programs of elementary schools; most of these 
appear equally applicable to the other levels of education: 


1. Adequate appraisal of the school includes more than the usual program of 
achievement testing. 

2. School appraisal should be diagnostic; that is, it should reveal the specific 
points of strength and of weakness in the school program, 

3. Every aspect of the school program should be appraised, regardless of its 
relative difficulty. 

4. Principals and teachers should play important parts in the appraisal of 
their own schools. Their responsibility for planning and initiating appraisal meas- 
ures will vary according to the plan of organization and administration in the 
school system as a whole. 

5. Within reasonable limits and under proper safeguards, pupils also may con- 
tribute to school appraisal. 

6. Evaluation of the school program should be carried on continuously. Perti- 
nent information should be collected thruout each year and summarized at least 
once a year. 

7. Methods of appraisal should be selected on the basis of their reliability, 
practicability, and appropriateness in the particular situation to be appraised. 
A combination of several methods is often better than one alone. 

8. Before undertaking an appraisal, principals and teachers should find out 
how competent workers elsewhere have evaluated similar elements of the school 
program, 

9. Careful subjective judgments formed in the light of valid criteria are better 
than conclusions based on objective data from a poorly planned or carelessly 
executed experiment, 

10. Every appraisal should be made with reference to specified criteria of some 
kind. Such criteria should themselves be carefully evaluated before they are used. 

11. Of the several types of criteria which may be used, those concerned with 
pupil development should receive first consideration. 

12. There should be close agreement between the accepted objectives of a school 
and the instruments which the. school uses to measure its attainment of these 
objectives. 

18. When it is impracticable to determine the merits of local school practices 
direetly, these praetices should be appraised with reference to the findings of 
available research studies and expert opinion outside the school. 


№ The National Elementary Principal, op, cit., 16: 237-238, July, 1937. 


EVALUATION OF SCHOOLS 381 


14. Statistical technics for determining the reliability of experimental results 
should be used only with a thoro understanding of their purpose and significance. 

15. The results of appraisal should be used to improve the school program. It 
is — that classroom teachers, as well as principals, be fully informed of these 
results. 

16. Parents and pupils also should be given accurate information concerning 
the strong and weak elements of the school program, so that they may help to 
improve it. 


For secondary schools. The Cooperative Study of Secondary School 
Standards has prepared the following eighteen principles, which provide a 
comprehensive philosophy not only for evaluating secondary schools, but 
also for evaluating other levels of education: 


1. American secondary schools, much as they may differ in details, are essen- 
tially alike in their underlying purposes and organization. 

2. In a democracy the fundamental doctrine of individual differences is as 
valid for schools as for individuals. Schools, as well as pupils, differ from each 
other markedly. 

3. A school can be studied satisfactorily and judged fairly only in terms of its 
own philosophy of education, its individually expressed purposes and objectives, 
the nature of the pupils with whom it has to deal, the needs of the community 
which it serves, and the nature of the American democracy of which it is a part. 
All American schools, however they may differ in type, have this in common: 
they are instrumentalities for transmitting our American heritage and our Ameri- 
can democratic ideals. Provided this aim can be clearly kept in view in every case, 
each school is free to determine its own educational policies in promoting the ideals 
of American civilization. ERR Б 

4. A school should be judged in terms of the extent to which it meets satis- 
factorily the needs of all pupils who should come to it, not alone of those who 
continue their formal education in institutions of higher learning. А 

5. Methods of accreditation and interpretation of evaluation should recognize 
the differences in background, development, and existing conditions in different 
states and regions. No attempt should be made to develop uniform standards for 
the nation or to have them administered from a single national office. we 

6. It is more significant to measure what the school does than what it is or. 
what it has, The educational process and product are more important to evaluate 
than the machinery and equipment. i 

7. A school should be judged as a whole, not merely as the sum of its separate 
parts. 

8. The number of factors evaluated in the modern secondary school should be 
sufficiently large and varied to give valid evidence of the worth of the school in each 
of its main areas. у "eus А 

9. Accrediting criteria and procedures should be brief enough in extent, suffi- 
ciently varied in form, and convenient enough in arrangement to be practicable 
for use in all ndary schools. А 

10. Methods of evtlutiót and accreditation, as far as possible, should be based 
üpon scientific studies and objective evidence, rather than upon untested assump- 


tions and unsupported opinions. 


2 Evaluation of Secondary Schools, General Report, op. cit., pages 57-61; also How to 
Evaluate a Secondary School (1940 Edition), op..cit., pages 17-21. 


382 MEASUREMENT IN INSTRUCTION 


11. The considered judgment of competent educators is an essentia! faetor in the 
evaluation of the quality and character of the work of a school. 

12. A valid method of evaluation and accreditation, based tentatively upen 
existing research studies and expert judgment, should be fully tested by extensive 
experimental try-out in a large group of typical, representative secondary schools 
throughout the country. The results of this experimentation should be carefully 
analyzed and evaluated. 

13. While it is desirable in many respects that definite standards or levels of 
achievement should be developed, it is recognized that in most of the important 
aspects of a school’s work the best available basis for the development of useful 
standards will probably be comparison with the practices in other comparable 
schools. 

14. A good school is a growing school. It should be judged by its progress between 
two different dates as well as by its status at a single date. 

15. Any useful, stimulating, and valid method of accreditation should be flexible 
with the passage of time; that is, it should be capable of reasonable modification 
as new bases of evaluation and different levels of achievement are suggested or 
developed from the use of existing ones. 

16. If criteria for evaluation are sufficiently flexible, extensive, and thorough, it 
is not essential that they be applied annually. 

17. The bases and methods of evaluation should be such as to require active 
participation in the process on the part of the entire professional and non-profes- 
sional staffs of the school. 

18. An important function of a national, regional, or state agency should be 
stimulation toward continuous growth and improvement, not merely inspection 
and admission to membership. 


For higher institutions. The Committee on Revision of Standards 
created by the Commission on Higher Institutions of the North Central 
Association of Colleges and Secondary Schools spent five years and 
$135,000 in making a study reported in a series of seven monographs.” 
Section VI, entitled “Institutional Purposes and Clientele," and regarded 
as “the very heart of the new accrediting policy," is as follows:” à 


Recognition will be given to the fact that the purposes of-higher education are 
varied and that a particular institution may devote itself to a limited group of 
objectives and ignore others, except that no institution will be accredited that 
does not offer minimal facilities for general education, or require the completion of 
general education for admission. F 

Every institution that applies for acereditment will offer a definition of its 
purposes that will include the following items: 

1, A statement of its objectives, if any, in general education. 

2. A statement of the occupational objectives, if any, for which it offers training. 

3. A statement of its objectives in individual development of students, including 
health and physical competence. 


This statement of purposes must be accompanied by a statement of the institu- 
tion's clientele showing the geographical area, the governmental unit, or the re- 


peu poupe from. which it draws students and from which financial support is 
erived, 


2 The Evaluation of Higher Institutions, published by University of Chicago Press. 
* George F. Zook and M. Е. Haggerty, Principles of Accrediting Higher Institutions, 
pages 150-151. Chicago: University of Chicago Press, 1934, 


Qualifications. 


EVALUATION OF SCHOOLS 


383 


The facilities and activities of an institution will be judged in terms of the pur- 
poses it seeks to serve. 


С. Evaluating Various Aspects of the School 


The philosophy of the school. There is rather remarkable agreement 
among the foregoing principles for evaluating the three levels of education 
but nowhere is the agreement more notable than upon the point that an 
institution must be appraised in terms of its own philosophy and objectives, 
The Cooperative Study, for example, recommends that “а secondary 
school be studied expressly in terms of its own philosophy of education, 
its individually stated purposes and objectives, the nature of the pupils 
with whom it has to deal, the needs of the community which it serves, and 


И. Philosophy of Secondary Education 
А. Эсхлетсамт Ponvts or View 


The material which follows is desi to secure the viewpoint of the school concetning various 
philosophy. There is no implication ЕН is the “right” one. Preferably only one item 
each group—the one with which your school is in closest agreement as a matter of fundamental belief, regardless of actual 
practice. Write any modification or qualification in the space provided, if you feel it necessary. 


Fundamental Concepts 


4. The type of political organization most desirable for 
society is one in which— 
{ ) а. The determination of policies is entrusted to 
cially trained personnel chosen by general elec- 


tion 

( ) b. Policiesaredetermined by individuals selected i 
an electordte which is restricted on the 
of racial or economic status 

( ) е. All individuals share in the determination of 

olicies in proportion to their abilities 

()da all individuals have equal voice in the deter- 
mination of policies 

( ) е. Individuals are completely subordinated to su- 

thority, and policies are determined by a mi- 

nority group 


2. The economic organization most desirable is one in which 
( ) в. Individuals may retain the results of production 
on the assumption that public welfare will be 
benefited by thelr philanthropies | 
( ) b. No restrictions are placed upon the right of an 
individual to amass weal 
{ ) c. Individuals may obtain wealth but are restricted 
by requirements of conservation of natural re- 


sources 
{ ) d. All share equally in the products of labor and 
industry м z 
4.) е Private enterprise is encouraged but with restric- 
tions assuring the conservation of natural re- 
sourcés and with provisions for the distribution 
of a considerable portion of the results of pro- 
duction in the interests of the workers and of 
the general public 


Figure 46. A Suggested Technique for Ev: 
School. (From Evaluative Criteria, 1940 Edition, 
Standards, Washington, page 8.) . 


ts of educational 
id be checked in 


3. The social organization most desirable із one in which— 
{ ) а. There are groups which have special social priv- 
Segue because of bereditary family connec 
jons 
( ) b. Social position depends professional, reli- 
AP. racial, or nationality status 
(de individuals have equal social status regard- 
less of economic, cultural, or intellectual qual- 
ifications and regardless of race or nationality 
( ) d. All individuals of the dominant racial or nation- 
aly ssp bave ка), social position regard- 
1, economic, cultural, or intellectual qual- 
ifications 

{ ) е. Social position is given to any individual who has. 

achieved special distinction in his Geld 


4. In a democracy the school should place most emphasis 
n helping to prepare pupils— 
( ) а. To make adjustments to present social and eco- 
nomic conditions 
b. To participate in the reconstruction of society 
{ c. To make adjustments to meet changing condi- 
tions 


Qualifications: 


5. In a democracy free secondary education should be pro- 


vided for— 

( ) a. All adolescents who are not mentally or physi- 
cally defective to such an extent tbat they 
cannot be educated with normal children 

{ ) b. Only those adolescents of superior intellec! 
ability 

( ) с. Those adolescents who can profit by a college 
preparatory, cultural, disciplinary program 

{ ) d. Only those adolescents of superior social or eco 
nomic status 

) e. All adolescents 
ificalions: 


aluating the Philosophy of a Secondary 
Cooperative Study of Secondary School 


384 MEASUREMENT IN INSTRUCTION 


the nature of the American democracy of which it is a part."?* It recognizes 
four distinct phases in the satisfactory evaluation of a secondary school:% 


1. Statement by the school of its philosophy of secondary education and of its 
objectives. 

2. Checking and validation of the statements of philosophy and objectives 
against the needs of the pupil population and community which the school serves. 

3. Revision or modification of the statements of philosophy and objectives, if 
necessary, in light of step number 2 above. i 

4. Evaluation of all aspeets of the school in terms of these revised statements 
of philosophy and objectives. This phase involves the use of the rest of the Evalu- 
ative Criteria. 


Figure 46 illustrates one of the procedures suggested for formulating the 
school's philosophy. Step 2 above indicates briefly the procedure to be fol- 
lowed in evaluating this philosophy, which is largely a matter of checking 
it for clearness, for internal consistency, and for appropriateness to the 
community to be served. Regarding the pupils and the community, the 
basic data which are required for the external evaluation are as follows 2° 


I. Basic data regarding pupils 

. Graduates and enrollment by grades and by sex 

. Number of years seniors have been in the school 

. Distribution of withdrawals according to cause 

. Age-grade distribution of pupils 

. Distribution of I.Q.'s by grades 

. Educational intentions of seniors by sex 

- Occupational intentions of seniors by sex 

AI. Basic data regarding the community 

. Population data for the school community 

- Occupational status of adults 

. Occupational status of youth of secondary school age 
. Educational status of adults 

. Financial resources of the school district 

- Agencies affecting education 

. Additional socio-economic information (seven items) 


Q'umgotu» 


ово 


The educational program. A satisfactory statement of the school's 
philosophy having been formulated, a basis is now available for evaluating 
the educational program and organization of the school. The general point 
of view and procedure of the Cooperative Study is given in the “Instruc- 
tions" reproduced in Figure 47. 

The following list of educational “temperatures” indicates the compre- 
hensive character of the evaluation of the educational program: 


23 How to Evaluate a Secondary School (1940 Edition), op. cit., page 65. 


** Evaluative Criteria (1940 Edition), page 6, Washington, D, C,: Cooperative Study 
of Secondary School Standards, 


2° Ibid., pages 17-28, 


EVALUATION OF SCHOOLS 385 


1. Curriculum and Course of Study, 


General principles; curriculum development; amount of offerings; English; 
ancient languages; modern languages; mathematics; sciences; social studies; 
music; arts and erafts; industrial arts; homemaking; agriculture; business 
education; health and physical education for vocational shop; general evalua- 
tion. 


2. Pupil Activity Program. 
Nature and organization; school government; home rooms; school assembly; 
school publications; music activities; dramatics and speech; social life; phy: 


cal activities of boys; physical activities of girls; school clubs; finances; general 
evaluation. 


{nstructions 


GENERAL 

In checking and evaluating the various features included in this: the underlying philos- 
ophy and expressed purposes and objectives of the school and the nature of the pupil population and 
community which it serves (as ou! in Sections B and C) should be kept constantly in mind. 
Evaluations are to be made in the light of these factors. Persons evaluations should continu- 
ally ask: “Do the practices in the school evaluated accord with the philosophy and objectives 
of the school and meet the needs of its pupil population and community as well as do the practices 
of other schools?” They should not consider the size, type, or location of the school, the financial 
support available, state requirements, or other local factors, except in so far as these factors may 
have a legitimate effect on the philosophy and objectives of the school or on the needs of the come 
munity. In later interpretation of the results of evaluations suitable allowance may be made for any 
of these factors, but at the time of evaluation an attempt should be made to evaluate the actual 
program of the school regardless of necessary limitations. 

The two-fold nature of the work—evaluation and stimulation to t—should also be 
kept constantly in mind. Careful, discriminating judgment is essential if these pps are to be 
satisfactorily served. While the attainment of a score may be desirable, it secondary ime 
portance. It should not be permitted to interfere with accusate evaluation; otherwise, real improves 
ment cannot be undertaken and attained. 

Those making evaluations should be constantly on guard against the common tendency to 
choose the higher of two possible evaluations when in doubt. Unless a superior evaluation is definitely 
indicated and justified by available evidence, one of average or below average should be made. 


CHECKLISTS 

The checklists consist of provisions, conditions or characteristics found in good secondary 
Schools. Not all of them are necessary, or even desirable, in every good school. Nor do these lists con- 
ра SN is Sera Me EC ISI coal ANON Па items listed but 

ave other compensating features. 

‘The use of the checklists uires four symbols. (1) If the provision or provisions called for in а 
given item of the checklist are definitely made or if the conditions indicated are present to a very 
satisfactory degree, mark the item, in the parenthesis preceding it, with the symbol (+); (2) if the 
provision is only fairly well made or the conditions are only fairly well met, mark the item with the 
symbol (—); hil the provisions or conditions are needed but are not made, or are very poorly made, 
oF are not present to any significant degree, mark the item with the symbol (0); (4) if it is unnecessary 
or unwise for the school to have or to supply what specific items call for, mark such items with the 
pute (№ a (Nate S to be regarded merely as convenient symbols, not mathematical 
terms.) In , mark items; 

4 condition or provision is present or made to з very satisfactory degree 

— condition or provision is present to some extent or only fairly well made 

0 condition or provision а not pat or is not satisfactory 

N condition or provision does ni t 
вратот Pie ud of cach chodit for writing io additonal itema. 


EVALUATIONS 

E tii to be made, wherever called tor, ‘on the basis of personal observation and ju 
PRO the checklist as marked ín accordance with око aad d d 
other available Mise. e a pw r e v һм. уи figures аге 

м 
еа оу а “provisions ог «осон are peent a and fanctioning to the exteat found 
i t H regionally-accredi: 
NELLO iy ‘or conditions are present and functioning to the extent found ia 


2.—Inferior; the provisions or con 5 
20% of -aci ols." Ў 
i= НЯ the] m = present and functioning to the extent found in 
tely the lowest 1 regionally-accredited schools.’ , 
Dua nor oh hy. (When this Erb i, used. explanation (to fhe reason the section does 
uld be given ш Е 
А MOM of compensating features or particular shortcomings, explana- 
tions, justifications of evaluations. or other pertinent matters. 


А invelved primarily reglonullpe. 
cnc any ann ema ee rer ERE UT egli pean ei ot pd 


З k А Te lopet "by the iGo. 
Fi 47. Instructions for Using the Evaluative Criteria Develo) y the ( 
м Study of Secondary School Standards. (From Evaluative Criteria, 1940 Edition, 
Cooperative Study of Secondary School Standards, Washington, page 30.) 


386 MEASUREMENT IN INSTRUCTION 


3. Library Service. 
Library staff; organization and administration; book collection, number of 
titles; book collection, recency; book collection, general adequacy; periodicals; 
supplementary materials; selection of materials; teachers and the library; use 
by pupils; general evaluation. 


4. Guidance Service. 
Nature and organization; guidance staff; information about pupils; guidance 
procedures; phases of guidance; results; general evaluation. 


5. Instruction. 
Classroom activities; use of community; textbooks; methods of appraisal; 
special committee judgment; general evaluation. 


SUMMARY 
OF EVALUATIVE CRITERIA 


EDUCATIONAL PROGRAM 


CERA VR да анк 
CummicULUM іс LIBRARY @UIDANCE INSTRUCTION oUrcomes SAFF RAN EM Ton 


t200) (200) t200) (200) (200) (гоо) (200) 


(200) 


| Very | 
Superior 


Superior 


Average 


taferlor 


Figure 48. Summary of Evaluative Criteria for the Median Secondary School. 
(From How to Evaluate a Secondary School, 1940 Edition, Cooperative Study of Sec- 
ondary School Standards, Washington, page 97.) 


EVALUATION OF SCHOOLS 387 


6. Outcomes. 


Evaluation. procedures; attainment in the principa] subject matter fields: 
attitudes and appreciations, 


Figure 48 illustrates the graphical summary of these "temperatures" for 
the median school. It will be noted that this school, which happens to be 
a large public school, is rated “average” (between the 30th and 70th per- 
centiles) in four of the six areas of the educational program, The school is 
only at the 11th percentile, or "inferior," in the curriculum, however, and 
at the 80th percentile, or “superior,” in library. The graphical device em- 
ployed makes it possible to see at a glance the strong and weak points of 
the program. 


B. Content or Orrerwos 


In evaluating this page consider content of subject оооба аео. 
стег Mon Qo DE mon for поса нани matter and for skills, bat also [04 understanding 
ificance of the content lor attitudes, appreciations, and ideals. 
А copy of the school’s courses of study should be supplied if available; 1f not, а brief description et outline for each 
Course should be furnished. If the textbook serves as the course of study, it should be evaluated below. 
Include in the table only those subjects or courses in which a class із 
If there are subjects or fields which cannot be classified 
the headings in one of the columns. 
Note that the symbol M, “condition of provision does not n 
tions of this table whenever the Subject field should not be expec! 
ject field is not, and should not be, offered in the school. 


Сиғскизт 
In each major field or arta provision is 
made lor 


1. Stating the objectives to be at- 
tained 


2. Emphasizing significant contribu- 
tions of our social heritage to 
present day life values 


3. Promoting pupils’ understanding of 
present day social problems 


A Stimulating pupils’ interests and 
satisfying their needs 


5. Modifying courses to meet individ. 
ual differences. 
experien 


bject fields 
8. Suggesting methods to be used in 
attaining objectives 


10. Solving appropnate problems re- 
quiring tary research pro- 
cedures 


в. How well does cach course of study 
provide for applications to out-of- 
school life? 


Comments; 


Figure 49. An Evaluative Procedure for the Content of pio E таб 
| cipal Subject-Matter Fields of a Secondary School. i jn 35.) i 
| Edition, Cooperative Study of Secondary School Standards, р 


388 MEASUREMENT IN INSTRUCTION 


Figure 49 shows the checklist and evaluations proposed for the content 
of the offerings in the principal subject-matter fields. Of these, only the 
two with the double stars at the bottom of the columns are included in the 
short Gamma Scale of 25 thermometers. These two are also included in 
the Beta Scale, together with the three additional fields indicated by single 
stars. The others appear only in the complete Alpha Scale of 110 ther- 
mometers. 

Bruner proposed an elaborate set of criteria for judging courses of 
study. A gross scale of four points, Excellent, Good, Fair, and Poor, is 
provided. The following ten questions, for example, are suggested for judg- 
ing the extent to which the course of study is based upon psychological 
principles of learning:? 


—— 1, Is each new learning act considered to be in some degree remaking the 
whole organism? 


6 2. Is self-activity considered fundamental to learning? 


us 3. Is study conceived of as an attack upon the situation, “and what is 
learned is learned as and because it is needed for the control of this 
situation"? 5 j 
HUE 4. Are provisions made for taking into consideration the underlying principles 
of integration? 
. 5. Are the activities and materials organized into patterns which, if used, 
assist-in the better growing of the individual? 


Pest 6, Is the position held that the learner should experience satisfaction from 
engaging in activities? ' 


PER 7. Is knowledge considered as a means to enable the individual to participate 
more effectively in life situations? 

unl 8. Is significance attached to pupil meanings and insights? 

TE S 9. Is the view held that growth and learning are continuous throughout the 
life of the individual? 


Le (9) 10. Is provision made for making the situations of the school real and 
dramatic? 


The Cooperative Study suggests a variety of procedures for evaluating 
the library service of the secondary school. On the assumption that the 
library service should be a center of the educational life of the school and 
not merely a collection of books, it is asserted that adequate provisions for 
the school library should include the following: 


(1) a well educated, efficient librarian; (2) books and periodicals to supply the 
needs tor reference, research, and cultural and inspirational reading; (3) provision 
for keeping all materials fully cataloged and well organized; (4) a budget which pro- 
vides adequately for the maintenance and improvement of the library; (5) en- 


*5 Herbert B. Bruner, “Criteria for Evaluating Course-of-Stud y Materials,” Teachers 
College Record, 39: 107-120, November, 1937, 

7 Ibid., page 111. 

28 Bvaluative Criteria (1940 Edition), ор. ctt., page 51, 


EVALUATION OF SCHOOLS 389 


couragement of the pupils in the development of the habit of reading and enjoying 
books and periodicals of good quality and real value. 


Figure 50 illustrates the derivation of three measures of the adequacy 
of the book collection. It will be noted that books of the various classifica- 
tions are weighted unequally in obtaining the composites. The two ex- 
tremes are books on philosophy, with a weight of 1, and books on history, 
travel, and biography, with a weight of 20. 

Figure 51 shows the section on Teachers and Libraries and illustrates a 
different technique. This section seeks answers to two important questions: 
First, how extensively do the teachers make personal use of the resources 
of the library in promoting their own professional growth and in their 


11. Adequacy of Library Materials 
А. Boor Сошеспон 
Include books catal 


o | 24 | Xxx] 
8/8 2 Lex 3E 
YI Td. 7" 
26284 as 


f£, Иа School has ten copies of ome tide hay на be en 
seit oni lana 


3 Dona 
MATS Mete ot main eniru in the Mita 


i { the Adequacy of the Book 

Fi . The Computation of Three Measures o 
n n the bra? of a Secondary School. (From How to н 2 ird 
School, 1940 Edition, Cooperative Study of Secondary School Standards, gton, 


page 77.) 


390 MEASUREMENT IN INSTRUCTION 


classroom planning and teaching? Second, how effectively do the teachers 
stimulate pupils to use the library materials? 
The Cooperative Study recognized five areas of guidance responsibility 


V. Teachers and Libraries | 
А. PERSONAL UsE 


CurckuisT 
( ) 1. Teachers use school and public libraries exten- 
sively to promote their own personal and pro- 
fessional growth 
{ ) 2. Teachers and supervisors use the library as в 
stimulus to curriculum development and en- 
richment 


жа EVALUATIONS 


(С ) 3. Teachers keep the librarian informed regarding 
prospective classroom demand: on the library 
and librarian 

© ) 4. Teachers use the library extensively in their 

Os classroom planning and teaching 


f } у. How extensively do teachers use libraries in classroom planning? 
3. How extensively do teachers use libraries for their leisure reading? 


Comments: 


В. ST™ULATION oF Рори, Use 


Снескілѕт 

( ) 1. Teachers stimulate pupils to use the library, 
individually or in groups, to find and organize 
materials on selected subjects or class projects 

{ ) 2. Teachers help pupils in the effective use of the 
library, largely by means of library references. 
needed in their classroom projects 

( ) 3. Teachers encourage pupils to use the library for 
recreational and leisure reading 


{ ) 4 Teachers, with the help of the librarian, use the 
library as a means of cultivating good study 
and learning habits in pupils 

( ) S. Teachers and classes borrow books and other 
library materials for use in the classroom 

{ ) 6. Each teacher keeps a record of the voluntary 

©)? reading done by the pupils in bis own field 


ae EVALUATION 
( ) =. How effectively do teachers stimulate pupils to use library materials? 


Comments: 
VI. Use of Libraries by Pupils 
СнЕскивт 

( ) 1. Selected pupils act as assistants in the library ( ) 5. Pupil activity organizations use the library exe 
as a means of education and exploration in tensively in the promotion of their projects 
library work. (The time and effort of such { ) 6. Pupils are learning to respect public property 
pupils are never exploited) and to help care for it 5 

{ )+ 2. Pupils, individually and in groups, commonly ( ) 7. Pupils are learning to respect the rights of 
find the library a profitable center for class- others, in the library and in the use of its 
room preparation à materials — we 

( ) 3. Pupils use libraries extensively for leisure read- ( Y 8 Pupils are learning to use other libraries in the 
ing and for developing other leisure interests community 9 " =. 08 

€ ) 4. Pupils help collect useful vertical file material ( ) 9. Pupils use the dormitory readingroom if avail 
for the library able 


€) 10. 
SUPPLEMENTARY DATA 
1. Average number of school library books circulated to pupils per month 


2. Average number of different pupils to whom school library books circulate per month. . 
3. Number of high school pupils holding public library cards. .. ... «sese 


EVALUATIONS 
x. How extensively do pupils use library books? 
y. How extensively do pupils use periodicals? 
з. How extensively do pupils use supplementary materials? 


Comments: 


Figure 51. Evaluative Techniques for the Library Service of a Secondary Schoo!. 
(From Evaluative Criteria, 1940 Edition, Cooperative Study of Secondary School 
Standards, Washington, page 59.) 


in the secondary school. These are regarded not as distinct types of guid- 
ance but rather as phases of an interrelated unitary process. These phases, 
together with the number of items in the checklists and the evaluations 
sought, are, in summary, as follows: 


Alp Edueational:Cnridanee: оон cere eme eet enm 28 items 
1. Articulation with lower schools 
How effective are procedures for articulation with lower schools? 


4 | | ЕЕЕ. 
a |.| E > 
gs i HE mm 
3 E j| TERT 
E e 


^ 
7 


oor | or | oot | 


V. Guidance Service 


: 1 i p : ч : 
311:1:222:3:520:122:8tfz22z0772:2222 
2 


b d 
‘wes E ас 1 
LL oun “oot Ла afin 
= |= o y. 


ocjo ten oee 


1511221915 


1225121222:7212::22::732212335:122222912229222:951124339 52:222 88 22222: 907 4 ^n 2 o nn - oe 


p 


$732222:2233328 7272 70 O LELET TIT tnos tec от осявяяяляаяняллная 
Nr ime um qu get EC 


SS E 
Eg 25 Sie 
РЕ Ч: | S 
и В $: >. LELILPLLI 
ac S dL eb teta chc ad dia MA 
E Pe sei м. 
E Е 1:551 
- О ФЕС ИЕ 
=e ЕЕ Е ер Д: ioo 
LU UA „№ 208 aq on 019-19: 
iQ) 3200s Kawunung | 
f" 


ГЕЈЕ ЕА 


z3|e€ |ez--[£z 


aal 


EAA - 
& e| = |e egek 
oF ZE pæ E gesoxvwpmo| rr 


ot 


9L | uoneneas [922420 | ТИЛ 
Е $t аку sunsay | 1A 
TL | 22uepins jo saseqg ^ 
11-0: аннат | л 
69-19 | зподе uoneuuojup | ш 


= st oz 
= i= s 
pua» [wuuv9| tag | vwd 
ms о "div 
Payday su 


IN 


amnseour jo apt, 


3405» Kivununs jo wonvinduio;) 


чоп 


bx 
bg и 
f oi -ece £9 d vorezjued10 pue amey | I 
Vea || snos Posag | mor vonenpesq ^ 
N 
| 


Troe Gawd p оршу 77 


RAO АЧУИЯПС 


1940 Edition, Coopera- 


mary Score for the Guidance Service of a 


cy 
LJ 
eo 
cs 
a 
= 
a 
a 
са 
ms 
RH 
x 
$ 
За 
PE 
ЕЕ 
58 
dj 
S.B 
LEN 
285 
*$3 
"em 


Figure 52. The Computation 


Secondary School. (From How to 
tive Study of Secondary School Stan 


391 


392 MEASUREMENT IN INSTRUCTION 


2. Curricular and school guidance 
How adequately is guidance provided in such matters as planning a 
sequence of studies, remedying study difficulties, ete.? 
3. Guidance concerning the post-secondary school 
How adequate are provisions for assisting pupils in choices involving the 
post-secondary school? . 
B. Vocational Guidance and Ріасетепф.......................:.... 14 items 
How adequate aré provisions for assisting pupils to make wise vocational 
choices? How adequate are provisions for placement and follow-up service? 


C. Guidance in Use of Leisure Time. ni Dea Thpir eee ence eee ee 6 items 
How adequately are pupils assisted in са wise choices of leisure activities? 
ВОО а GiyieGuidange yume ааа ene deme 8 items 


How adequately are pupils assisted in making wise choices in matters involv- 
ing social and civie relationships? 


ВЕ usu АЕ IIT a атаа. ә. 7 items 
How adequately are pupils assisted in making wise choices in personal 
matters? 


Figure 52 illustrates the computation of the summary score for the guid- 
ance service of the school when the Alpha Scale is used. It will be seen that 
the various evaluations are entered in spaces provided and then averaged. 
These point scores are next expressed in percentiles by the use of the stand- 
ard conversion table. These percentiles are then weighted to obtain a sum- 
mary score. The equivalent percentile is found from the summary conver- 
sion table at the right of the figure. The arrows indicate the sequence of 
events in the use of the tables. Similar conversion tables are used for the 
other phases of evaluation. 

The quality of instruction in the school is judged by having the work 
of each member of the teaching staff considered from the following points 
of view:?? 

A. Classroom Activities. 

1. The teacher’s plans and activities. 
2. Cooperation between pupils and teachers. 

B. Use of Community and Environment. 

C. Textbooks and Other Instructional Materials. 

1. Textbooks. i 
2. Other instructional materials, 
D. Methods of Appraisal. 
E. Special Committee Judgment. 


Figure 53 reproduces the last pages of this evaluation and illustrates the 
procedure. 

The philosophy underlying the evaluation of the outcomes of the educa- 
tional program is clearly stated in the following guiding principles: 


+° Evaluative Criteria (1940 Edition), op. cit, page 160, 
*? Ibid., page 83, 


EVALUATION OF SCHOOLS 393 


In the educational program of a good secondary school, major concern should be 
given to attaining desirable outcomes and to the various kinds of evidence indicating 
that such outcomes are being realized. It may be necessary to test some outeomes 
by departments or in elass groups. This, however, should not be construed as 
limiting the responsibilities of all phases of the educational program, including the 
instructional activities of teachers, pupil activity program, guidance service, library 
service, school plant, and school administration, for the attainment of desirable 
outcomes, There should be evidence that teachers and pupils are happily and 
harmoniously cooperating in the stimulation of a wholesome curiosity about them- 
selves and their environment. Evidence should be sought to show that pupils are 
securing knowledge and developing worthwhile skills, attitudes, tastes, apprecia- 
tions, and habits. There should be evidence that pupils are able to make desirable 
choices or to exercise good judgment in the selection of friends, vocations, leisure 
activities, goods and services, and in other important matters which confront 
youth today. Evaluation of such activities involves more than determining the 
amount of knowledge possessed, measuring the degree of skill, and testing the scope 
of understanding, important and necessary as all these are. Among others, in- 
tangible qualities such as cooperativeness, tolerance, open-mindedness, reverence, 
respect for law, and self-reliance are highly desirable outcomes. Evaluation of such 
outcomes is by no means easy; for most of them there is no standard measure and 


D. METHODS ор APPRAISAL 
Снкскілѕт 


( 1. The teacher understands the vse, the ( ) 8. Theteacher uses tests to stimalate and evaluate 
) advantages, and the а al. various. pupils’ understanding and ability to make 
types of tests and uses them р applications of Knowledge 
€ ) 2. The complete testing Sees les for { ) 9. The teacher uses tests to stimulate and evaluate 
many short tests it ew relatively long fons, attitudes, and ideals 
ones ( ) 10. Pupils use tests to evaluate their own Progress 
€ ) 3. Standardized згу аге ы рас зо есте а кабай aims of 
well as tests of the t 's own cons ^ 
4. Tests f lated by the teacher are so planned ( ) If. Diagnostic testing is a regular part of the teach- 
2 ‘that they p "i ma Son ad og poaae e Wiens by appropriate 
inistered, “mech; r A М 
О an A ( ) 4À Other methods of appen! such at olere 
{ > 5. Testing and measuring is an integral part of the tions of peer a ak maag А 
teaching and learning program rather than. ests, and rating ol ры t гум 
an activity set apart for certain days { ) 13. Results of tests are made further 
€ ) 6. The testing and eve sue романе A instruction. . 
upil progress rather than compa: 
4 7. The teacher uses a teata to set evalu- ( } 1$. 
te progress ai 
ment of desirable habits, skills, and 
edge 


* EVALUATIONS y E PATE 
. How well are methods of appraisal d intended? 
| E Поз vcl do pupils use methods of appraisal so mesure ей Mor у 


Comments: 


E. SPECIAL COMMITTEE JUDGMENT i 
This evaluation is to be made by the visiting committee after actual classroom visitation of the teacher. 
** EVALUATION $ 
€2s. How satisfactory isthe imsirwcsonal work carried om by this teacher? 
Comments: : = 


Я i ity of Instruction in a Secondary 
Figure 53. An Evaluative Technique for the Quality o 
Бос (From Evaluatioe Criteria, 1940 Edition, Cooperative Study of Secondary School 


Standards, Washington, page 160.) 


394 MEASUREMENT IN INSTRUCTION 


therefore evaluation of them necessarily will be largely a matter of judgment. The 
difficulty of the task is no reason for avoiding it, and the importance and uni- 
versality of the problems involved make it imperative that attention should be 
directed to the attainment of such outcomes and to their proper evaluation. 


Another useful instrument “designed to serve as a basis for the appraisal 
of individual school systems with respect to their adaptation to current 
educational needs" has been prepared by Mort and Cornell.* It covers 
much the same scope as the Cooperative Study, but the technique is dif- 
ferent. Specific questions are raised, to be answered Yes or No, with places 
for the supporting data at the left of the page. The scores for each section 
are then entered on a special score sheet, and by a simple process of weight- 
ing are combined into a single score. Table 43 gives the summary of this 
score sheet. Tentative norms are available for school systems of various 
sizes located in four states. The first of the ten parts of the section, which 
attempts to determine the degree to which the educational program recog- 
nizes the nature and extent of individual differences in pupils, is as fol- 
lows:* 


a. Intelligence Tests. Group and individual intelligence 
tests should be used as one of the means of analyzing 
problems of maladjustment. 


Q. How many of your pupils have been given intelligence 


tests? 

Interview: Principal, Guidance direc- 1. Individual intelligence 

tor, Psychologist (if any). tests have been given in 
Observe: Individual records, Test special cases. Yes... No... 

records. 2. A group intelligence test 

is given to all elemen- 

Е А tary and first year high 

OREM tate Per e LIS School pupils at least 
а ST t TER once in three years. ев. Nove. 


PEEL eo NUR EHE AURA SLE 3. Intelligence tests re- 
EE LU MS pel ея sults, for tests given 
E big REESE Ne t ER both in elementary and 
ТО in high school, are made 
edis SER ecu da Bd аных a part of the permanent 
EE Roc LEER UE record of the child. Yes... No... 
OP EE ET Het CRI CER TE B ES 4. Educational and  in- 
PAS M EE cn telligence tests results 
We PRO n Io TS east fece A ce (group and individual) 
solo ES У зае гар constitute опе of the ele- 
dut res sr erro woe xn ments upon which guid- 
Mei once i e AO На о ance is based. Yes... Nori: 


a Paul В. Mort and Francis С. Cornell, A Guide for Self-Appraisal of School Systems 
59 pages. New York: Bureau of Publications, Teachers College, Columbia University 
1937. 

32 [bid., page 26. 


EVALUATION OF SCHOOLS 395 


The school organization and plant. Numerous checklists and score 
cards have been prepared for rating school buildings, equipment, and ad- 


minist 


rative procedures. In this field, the pioneer work of George D. 


Strayer, N. L. Engelhardt, and their students? is especially notable. Two 
developments will be described briefly. The scope of the evaluation pro- 
cedures devised by the Cooperative Study is apparent from the following 
outlines of evaluative criteria: 


A. 


4. 
5. 


1 
2. 
3 


School Staff: 


. Numerical adequacy. 

Professional staff: selection, qualifications, improvement. 

. Nonprofessional staffs: qualifications, improvement in and conditions of 
service. 

Special characteristics of the school staff. 

General evaluation. 


B. School Plant: 


1. 
2. 
3. 
4. 


5. 
6. 


'The site: health and safety, economy and efficiency, influence on the edu- 
cational program. 

The building: health and safety, economy and efficiency, influence on the 
educational program. 

Equipment: health and safety, economy and efficiency, influence on the 
educational program. 

Special services: cafeterias, clinies, etc. 

Special characteristics of the school plant. 

General evaluation. 


C. School Administration: 


T; 
2. 
3. 


4 
5. 


6 
Та 
8 


Tha 


Administrative staff: numerical adequacy, preparation and qualifications, 
improvement in service. 

Organization: board of control, general policies, superintendent of schools, 
principal. 

Supervision of instruction: objectives, procedures and activities, principles, 
results. 

. Supervision of special services. 

Business management: general duties and procedure, budget, accounting, 
maintenance and operation. 

. School and community relations. 

Special characteristics of the school administration. 


. General evaluation. 


t the scope of the analysis proposed by Mort and Cornell is some- 


what similar is apparent from Table 40. Short sections relating to the 


school 


site will illustrate the differences in the two techniques. 


3 Published by Bureau of Publications, Teachers College, Columbia University, 
New York. 


TABLE 43 


SUMMARY OF THE MOoRT-CORNELL Score SHEET FOR THE SELF-APPRAISAL OF 
Ѕснооь Systems 


Section 


I. Classroom Instruction: 


IV. 


AS Phe currieuium: еее. 
1. Flexibility of curriculum. .......... 
2. Breadth of currículum. ............ 
3. Courses of вілу. ..:...:.......... 


8; Pupil aetrety та C LM 
1. Fieids of Іеатпіпв......:.......... 
2. Extracurricular activities........... 
3. Instructional materials............. 


. Special Services for Individual Pupils: 


A. Pupil records and attendance......... 
1. Educational accounting............ 
2. Census and attendance............ 


B. Provisions for individual differences. . . . 
1. Guidance: educational and voca- 


2. The individual and the educational 
РТО keel ee tero tax eei re 
8. Health service. ......... 2. esee 


. Educational Leadership: 


А. Supervision and school organization. ... 
1. Professionalization of personnel..... 
2. Supervision of instruction. ......... 
3. Grade and subject organization..... 


B. School administration and the com- 
Оу COAT LAE eU aes ee Tce еб 
1. Administrative planning. .......... 
2. Status of control.......... eese 


Physical Facilities and Business Manage- 

ment: 

A. The school plant.......... и 
1. School plant planning 
2. The school site........ 
3. School buildings. ...... anes 
4. Special rooms. ..'..... 3..1... EAT 


B. Business management................ 
1. Supplies and equipment. 
2. Financial accounting 


H 


Adjustments Mazimum 
Possible | Score 
Section | Total | Section | Total 
30 210 
10 70 
10 70 
10 70 
28 196 
13 91 
7 49 
8 56 
13 104 
т 56 
б 48 
26 156 
7 42 
10 60 
9 54 
21 105 
8 40 
& 40 
5 25 
21 105 
6 30 
T 35 
8 40 
: 30 90 
5 15 
5 15 
10 30 
10 30 
14 42 
T 21 
M. 21 
183 1,008 


EVALUATION IN 8CHOOLS 397 


B. ECONOMY AND EFFICIENCY* 


CHECKLIST 
( ) 1, The site is readily accessible to the school population. 


( ) 2. It is accessible over hard surfaced roads and adequate walks. 


( ) 3. It is sufficiently extensive for building and play needs, driveways, and 
landseaping. 


( ) 4. Play areas are readily accessible. 


( ) 5. The site has possibility of future expansion, extension, or adaptation 
without too great cost. 


( ) 6. It is as near the center of the school population as environmental con- 
ditions make advisable. 


CY v. 
( ).8. 


EVALUATIONS 
( ) а. How accessible is the site? 


( ) у. How extensive ts the site? 
( ) 2 How well adapted is the site for future expansion? 
Comments: 
THE SCHOOL SITE” 
d. Adaptability. Each school site should be laid out and devel- 


oped in consideration of both present and estimated future 
needs, \ Yes... No... 


Q. What is the average size of the elementary school sites? Of high 
school sites? How do you justify the size? 


Interview: Superintendent. 1. The superintendent has 
Observe: Plans and data on future developed plans for the 
growth, areas of present sites and layout of each perma- 
enrollment of schools. nent site in terms of 
estimated changes in en- 
Evidence... ta T JO] ИНЕ rollment and educa- 
ОАР S d oes tional program. IVER ог... 
Seres rs ce a ВЫ ИЕА 2. Present development of 
Т ECRIRE sites is such that adjust- 


ments can be made with 
minimum of cost to ex- 
panding enrollment and 
program. Yes... No... 


An index of variation. N. L. Engelhardt, Jr.,* has prepared an index 
of variation, which is based upon the assumption that any true equalization 
of educational opportunities must provide for variability, rather than uni- 


34 Hvaluative Criteria (1940 Edition), op. cit, page 116. 

35 A Guide for Self-Appraisal of School Systems, op. cit, page 49. 

38 The Report of A Survey of the Public Schools of Pittsburgh, Pennsylvania, pages 
437-444. New York: Bureau of Publications, Teachers College, Columbia University, 
1940, Also see The American School and University Yearbook for 1940. 


398 MEASUREMENT IN INSTRUCTION 


formity, in the school plant and program. The needed variation must con- 
sider the impact upon the pupil of the physical and social characteristics 
of the environment, as well as the personal qualities of the people to be 
served. For example, the thirty-eight factors that should be taken into 
account in determining the educational program of a community relate to 
the age distribution of the population, the health conditions, the housing 
conditions, and the social conditions of the community. 

Concluding statement. Evaluation is by no means a new idea in 
education, although the concept has been greatly enlarged in recent years. 
Many new techniques have been devised to supplement those already in 
existence, and in some cases to supplant them altogether. Much remains 
to be done, however. In the meantime educators should acquaint them- 
selves with the uses and limitations of the techniques which have been 
developed. There is no escaping the fact that evaluation is one of the most 
difficult, as well as one of the most important, problems in the modern 
school. The best existing evidence that a school is good is the fact that it 
is continually studying to find ways to improve itself. 


SELECTED REFERENCES FOR FURTHER READING 


Benson, Arthur L., How to Use the Criteria for Evaluating Guidance Programs in 
Secondary Schools, Form В. Washington, D. C.: U. S. Office of Education, 
March, 1949. 

Cornell, Francis G., Lindvall, Carl M., and Saupe, Joe L., “An Exploratory 
Measurement of Schools and Classrooms,” University of Illinois Bulletin, 50: 
1-71, No. 75, June, 1953. 

Dailey, John T., “Development and Application of Tests of Educational Achieve- 
ment Outside the Schools,” Review of Educational Research, 23:102-109, Febru- 
ary, 1953. 

Domas, Simeon J., and Tiedeman, David V., “Teacher Competence: an Annotated 
Bibliography,” J ournal of Experimental Tiducation, 19: 101-218, December, 1950. 

Jahoda, Marie; Deutsch, Morton; and Cook, Stuart W., Research Methods in Social 
Relations. New York: ‘Dryden Press, 1951. Chapters 5 and 14, “Data Collection: 
Observational Methods," and “Observational Field-Work Methods. d 

Leonard, J. Paul, and Eurich, Alvin C., Editors, An Evaluation of Modern Edu- 
cation. New York: D. Appleton-Century Company, 1942. 299 pages. 

Pace, C. Robert, and Browne, Arthur D., “Trend and Survey Studies," Review of 
Educational Research, 21: 337-349, December, 1951. 

Reavis, W. C., and Cooper, D. H., Evaluation of Teacher Merit in City School 
Systems. Chicago: University of Chicago Press, 1945. 189 pages. 

Sells, Saul B., and Ellis, Robert W., “Observational Procedures Used in Research,” 
Review of Educational Research, 21: 432-449, December, 1951. 

Traxler, Arthur E. (Editor), “Measurement and Evaluation in the Improvement 
of Education.” American Council on Education Studies, Series 1, No. 46, Vol. XY, 
April, 1951. 141 pages. See especially pages 58-67, “Planning a Comprehensive 
Evaluation Program," by Paul B. Diederich. 


EVALUATION OF SCHOOLS 399 


Troyer, Maurice E., and Pace, C. Robert, Evaluation in Teacher Education. 
Washington, D. C., American Council on Education, 1944. 308 pages. 
Whitney, Frederick L., The Elements of Research (Third Edition). New York: 


Prentice-Hall, Inc., 1950. ANNE IV, "Representative Federal Surveys of 
Education." 


16 


Public Relations 


A. The Problem 


The following remarks by Robert L. Thorndike, though not pertaining 
directly to school situations, might well be applied to them :1 


In personnel selection, as in most fields, there is no lack of polished individuals 
who present in a compelling manner some completely unscientific and unvalidated 
technique. .'. . It is often true, unfortunately, that the best salesmanship is 
applied to the poorest product. The temperament which is disposed to careful and 
exaeting research tends not to take kindly to or have a gift for promotion. But it is 
just this sound scientific worker who must in self-defense develop effective pro- 
motion for his service. The layman does not have the background to discriminate 
between effective personnel research and quackery. He must be educated and 
trained to discriminate between the tested results of a sound personnel system and 
the unfounded claims of the quack. The more scientific and rigorous a personnel 
research worker is, the more important it is for him carefully to consider the public 
relations side of his work. 


Thirty-six years earlier his father and Kandel had quoted a nineteenth 
century educator writing in a similar vein: 


Much of the scepticism prevalent as to the power and value of popular educa- 
tion arises from the inability of the educationist, or of the school teacher, to adduce 
satisfactory satistical evidence of the moral or of the intellectual results from any 
special courses of instruction or training, as manifested in after life.? 


1 Robert L. Thorndike, Personnel Selection; Test and Measurement Techniques, page 
313. New York: John Wiley & Sons, 1949. 3 

? Edward L. Thorndike and Isaac L. Kandel in “Educational Measurement of Fifty 
Years Ago,” Journal of Educational Psychology, 4: 551, November, 1913, from E. Chad- 
wick’s article in the Museum, a Quarterly Magazine of Education, Literature and Science, 
3: 480-484, 1864, 


400 


7 


PUBLIC RELATIONS 401 


Seventy-five years after 1864 a survey of parent opinion of publie educa- 
tion in America revealed deep concern over “teacher-made courses of 
study” in which the parents have been allowed no voice, as well as concern 
over “ће stress schools place upon the spectacular in education to the 
detriment of programs that will improve the manners and morals of chil- 
dren."? These statements indicate that the publie and the school do not 
always understand each other, and consequently do not work together to 
their mutual advantage. 

The meaning of public relations programs. Narrowly conceived, 
the public relations program of the school is synonymous with the publicity 
activities of the school. In recent years, however, the terms publicity and 
propaganda have become so closely associated and so discredited in the 
public mind as to arouse suspicion that something sinister is about to be 
“put over.” Broadly conceived, public relations is merely one important 
aspect of the school’s program of adult education. Its primary aims are 
two: (1) better understanding by the public of the purposes, programs, 
accomplishments, and needs of the school and (2) better understanding 
by the school.of the desires and needs of the community as reflected in the 
educational views of the public. In other words, its purpose is to effect the 
maximum co-operation between the community's two most important edu- 
cational institutions, the home and the school. And it must be remembered 
always that the child is the connecting link between them. 

À prominent educational leader* includes among the important purposes 
of measurement and evaluation in the modern school the provision of 
"psychological security to the school staff, to the pupils, and to the pa- 
rents," and a "sound basis of publie relations." Concerning the latter 
Tyler says: 

No factor is so important in establishing constructive and co-operative relations 
with the community as an understanding on the part of the community of the 
effectiveness of the school. A careful and comprehensive evaluation should provide 
evidence that can be widely publicized and used to inform the school community 
about the value of the school program. Many of the criticisms of the school ex- 
pressed by the taxpayers and parents can be met and turned into constructive 
ens if concrete evidence is available regarding the accomplishments of the 
school, 


There are several reasons for thinking that the problem is becoming 
increasingly important and difficult as the years go by. The enlarged en- 
rollment in the secondary school and the accompanying expansion in the 
school program have brought many changes which the publie does not 
understand. This fact is mainly responsible for the common charge that 
the modern school curriculum is cluttered up with all sorts of useless “fads 


2 Lester S. Ivins, “What Parents Expect of the School,” Journal of the National Edu- 


cation Association, 28: 194, October, 1939. ( 
“Ralph W. Tyler, “The Place of Evaluation in Modern Education," Elementary 


School Journal, 41: 19-27, September, 1940. 


402 MEASUREMENT IN INSTRUCTION 


and frills.” The increasing burden of taxation has naturally made the 
citizens critical of all publie expenditures. Since in most communities 
the publie school system is the biggest publie business, it is likely to 
bear the brunt of the attack. Nor should one overlook the stubborn faet 
that the enormous expansion of such enterprises as are provided for by 
the social security and old-age pension legislation has greatly increased the 
competition for the taxpayer's dollar. In such a situation it is especially 
well to keep in mind the wise statement of President Madison: ‘‘A popular 
government without popular information or the means of acquiring it is 
but a prologue to a farce or a tragedy, or perhaps both." 

The principal sources of “popular information" may be conveniently 
grouped as follows: 


1. Ordinary agencies: local newspapers, student publications. 
2. Official publications: reports, bulletins, handbooks, etc. 

3. Report cards and letters to parents. 

4. Miscellaneous: public programs, exhibits, P.T.A., etc. 


Each of these will now receive brief discussion. 


B. Ordinary Agencies of Public Information 


Local newspapers. As a medium for bringing about desirable relations 
between the school and its public, the newspaper ranks high. For most 
people it is the principal source of information, but school news as reported 
in the local paper is likely to be narrow in scope and lack proportion. An 
extensive early study by Farley® revealed that as a rule the patrons re- 
ceived least information on the school topics in which they were most in- 
terested and most information on school topies in which they were least 
interested. Table 41 summarizes the situation. 

It may not be surprising but it is certainly unfortunate to find that in 
the typical newspaper the total space devoted to the first six items in order 
of patrons' interest was less than half that given to extracurricular activ- 
ities, which stand at the bottom of the list. Both the school and newspaper 
appear to take for granted the excellent work of the classroom, which, 
therefore, falls in the dog-bites-man category rather than in the news clas- 
sification.” They both apparently forget that a report of the incident is the 
most interesting thing in the world to the owner of the dog as well as to 
the man who has been bitten. Parents never tire of hearing good reports 
of their own children. There is no good reason why the educational side 
shows should be allowed to swallow up the main tent. Farley calls attention 
to these facts:* 

5 Belmont Mercer Farley, What to Tell the People about the Public Schools, 136 pages. 
New York: Bureau of Publications, Teachers College, Columbia University, 1929. 

* Ibid., adapted from pages 16 and 49. 

т Бог an instructive discussion of this point, see: Edwin J. Brown, Secondary-School 


Administration, pages 270-271. Boston: Houghton Mifflin Company, 1938. 
$ Ibid., pages 16, 17. 


PUBLIC RELATIONS 403 


TABLE 41 


Tue RANK Овревѕ or TumrEEN Topics or Зсноог, News AccompiNG TO THE 
INTERESTS OF 5,007 Scoot PATRONS COMPARED WITH THE SPACE Devorep 
TO Тнезв Topics BY TEN Newspapers (AFTER FARLEY) 
—щ————— 
Rank According to 
Topics of School News ———————— 
Patrons' Interests | Space in News 


Pupil progress and achievement. ............. 


1 4 
Method of instruction. ese e etae eren. 2 10 
Health of pupils: c: zoe 3 9 
Courses of вау 4 6 
Value of education. P E 5 12 
Discipline and behavior of рирйв.............. 6 11 
Teachers and school officers................... 7 2 
Attendance... соло sven cape EM 8 13 
Buildings and building program........ MUS 9 8 
Business management and finance............. 10 7 
Board of education and administration. ........ 11 5 
Parent-teacher association. ............0+0.+0- 12 3 
Extracurricular activities.....,............-.- 13 1 


In other words, patrons wish to know what their children are being taught, how 
they are being taught, what results are being achieved, and how the public schools 
affect the physical welfare of their children. . . . They are ready to listen to the 
educator tell them that the results achieved in the schools are desirable, that they 
are achieved by efficient, scientific methods, that children are taught useful habits 
and skills, that their physical welfare is not neglected. 


Student publications. Student publications should occupy a strategic 
position in any publie relations program. They represent activities that 
have educational value in themselves and thus constitute important ex- 
hibits of the actual work of the school. Of these publications the school 
newspaper and the yearbook or annual are most important. Since they are 
written primarily for the pupils and patrons of the school, these publica- 
tions can portray the actual operation of the school program more fully 
than the general newspaper, which must appeal to a wider public. What the 
student does is always of interest to other students and to parents but 
examination of the student publications of most schools would probably 
reveal a very distorted picture of the school situation. As in the regu- 
lar newspaper, the extracurricular program looms large. The reader can 
scarcely escape the conclusion that the school year is largely occupied with 
social affairs and athletics. Those who criticize publie education as an ex- 
ponent of “fads and frills” could hardly do better than introduce the year- 
book as Exhibit A. Beside the stadium, the library dwindles into insig- 
nificance, and such things as classrooms and laboratories are deemed so, 
unimportant as to be omitted altogether. It is not too much to expect that 
the student publications present a truer picture of the school, giving greater 
prominence to those features which justify its existence. That the public 


404 MEASUREMENT IN INSTRUCTION 


is genuinely interested in these, there can be no doubt. Certainly parents 
would put evidence of pupil progress and achievement at the top of the list. 


C. Official Publications 


Annual reports. The earliest record of a formal written educational 
report was made in Boston, Massachusetts, in 1738, although informal 
oral reports had been made to town mectings in New England at an earlier 
period.’ It is clear that from the outset the primary function of such reports 
has been to inform the public regarding the aims, progress, and needs of 
the schools, and to afford an intelligent basis for determining educational 
policies. The first written report, for example, gave the enrollment in each 
school and included comments by the visiting committee on the quality 
of instruction. The function of such reports was well stated in the intro- 
ductory pages of the 1841-1842 report of Fall River, Massachusetts, as 
follows: 

Those who are taxed to support Public Schools, have a right to know how their 
‘money is expended, and what is the character of the schools which they are re- 
quired to maintain. The committee are but the agents employed by the town to 
take the agency of Common School Education, and the employer ought to be 
made acquainted with all that appertains to his interest, in respect to this agency. 
What the committee knows as to the schools, the town ought to know. 

Since the appearance of standardized tests, the annual reports often 
describe the tests used and the purposes for which they are employed, and 
give summaries of the results. Some cities make effective use of graphs to 
show that progress in the tool subjects is regular from grade to grade, as 
well as profile charts to illustrate the use of standard tests in the diagnosis 
and guidance of individual pupils. There is no way better than test results 
to show the need for curriculum changes, guidance services, ungraded 
classes, and other provisions for individual differences. There can be little 
doubt that parents are interested in receiving not only an account of how 
the money for public education was spent, but also of what it bought in 
the way of an efficient educational program. 

But most school reports have one fatal weakness: They are not read. 
The reason for this has been stated as follows: “Most official reports are 
dull. Their authors, though they have the most interesting material in the 
world, treat it perfunctorily, statistically, as lifeless stuff to be put away 
in mortuary files." И The problem with school reports, аз G. Stanley Hall 
long ago pointed out in the case of moral education, is how to make virtue 
exciting.” 

° Ward G, Reeder, An Introduction to Public-School Relations, pages 85-87. New York: 
The Macmillan Company, 1937. 

* Quoted from M. G. Neale, School Reports as a Means of Securing Additional Support 
da Son in American Cities, pages 4-5. Columbia, Mo.: Missouri Book Company, 


1 Editorial in The New York Times, January 4, 1926. 
? For a good discussion see: Ward G. Reeder, ор. cit., pages 89-104, 


PUBLIC RELATIONS 405 


Special reports and publications. It must be recognized that nothing 
is great or small, good or bad, except by comparison. Because of this fact, 
school surveys, which attempt to interpret the local schools in relation to 
those of other systems of similar size, are important. At times, such studies 
made by impartial outside agencies are especially effective. It is even bet- 
ter, perhaps, to have a continuous self-survey, and to report at strategic 
intervals various phases of the school program. The larger cities employ 
for this purpose bulletins or magazines modeled after the house organs of 
industrial organizations. Graphical comparisons of standardized test scores 
with national norms may be so reported. 

A common criticism of the modern school program is that it has allowed 
the newer "fads and frills” to displace the older “fundamental subjects.” 
People long for “the good old days when people really learned something 
when they went to school.” The most effective argument with which to 
meet such criticism is a comparison of the achievement of the older schools 
and the newer, or of the traditional school program and the more liberal 
program of today. Riley ? made such a study in Springfield, Massachusetts, 
of the results of tests in 1906 that had first been given to children in the 
city sixty years earlier, in 1846. The findings, briefly summarized below 
in terms of percentage of correct responses, were favorable to the later 
schools. 


Percentage Correct 


1846 1905-1906 
Arithmetie.......... 29.4 65.2 
Spelling 7 40.6 51.2 
Geography....... - 40.3 53.4 


Fish" made a somewhat similar study comparing the achievement of 
Boston children in 1928 with that of pupils in the city on the same tests 
in 1853, seventy-five years earlier. Again the results, expressed in terms 
of errors made, favored the later schools: 


Errors Made 
Subjects 
1858 1988 
Arithmetic.,.......- 5.4 1.6 
Grammar., seis 6.5 3.1 
СеовтарВу.......... 4.4 4.2 


“J. L. Riley, The Springfield Tests. Springfield, Mass.: The Holden Patent Book 
Company, 1908. 

"Louis J. Fish, Examinations 
Book Company, 1930. 


Seventy-Five Years Ago and Today. Yonkers: World 


406 MEASUREMENT IN INSTRUCTION 


"These studies suggest preserving the results of standardized tests so that 
at intervals of perhaps ten or twenty years they can be compared with 
eurrent results on these tests, which will afford convincing evidence of 
trends in efficiency. A study of this type, covering achievement in Phila- 
delphia high schools for a ten-year period, has been made by Boyer and 
Gordon;'5 and a study of arithmetic for a twelve-year period in St. Louis 
has been made by Boss.!5 


D. Report Cards and Letters to Parents 


Trends in report cards. For many years report cards have furnished 
the most direct line of communication between the home and the school. 
They have ordinarily consisted of a record of the pupil’s attendance and 
academic achievement, expressed in teachers’ marks, sent to the parent at 
intervals of a month or six weeks. In recent years, however, certain impor- 
tant changes have taken place. In a comprehensive survey of the literature 
relating to report cards, Messenger and Watts" noted the following trends: 


1. There is general dissatisfaction with any scheme of grading that encourages 
the comparison of pupils with each other. 

2. If any grades are used, a scale with fewer points is favored, a three-point 
scale being most often recommended. 

3. There is a wide-spread feeling that the schools should evaluate traits other 
than mere subject-matter achievement. 

4. There is a clear tendency to use descriptive rather than quantitative reports. 

5. Report cards are being displaced by notes or letters to parents. 

6. Cards, notes, or letters are being sent at less frequent intervals and in some 
schools only when there is specifie occasion for such communications. 

7. Attempts are being made to give more detailed diagnosis of pupils’ achieve- 
ments. 

8. Parents are being asked to cooperate in building report forms. 

9. Pupils are cooperating both in devising report cards and in evaluating their 
own accomplishment. 


A study? of trends in nine western states indicated that these changes 
were more marked in the elementary than in the secondary school. This 
study notes a wholesome effect on the personalities of the pupils, the effect 
being especially marked for those of lower ability. The most noticeable 
effect, however, appears to be in improved teacher-pupil relationships. 

There is also evidence that these newer systems of reporting are often 
approved by the parents. After six years' experience, for example, one 
writer makes this positive statement: “Тһе letter fosters a much more co- 

1 Philip A. Boyer and Hans C. Gordon, “Have High Schools Neglected Academic 


Achievement?"', School and Society, 49: 810-812, June 24, 1939. 

16 Mabel E. Boss, “Arithmetic, Then and Now," School and Society, 51: 391-392, 
March 23, 1940. 

17 Helen R. Messenger and Winifred Watts, “Summaries of Selected Articles on School 
Report Cards," Educational Administration and Supervision, 21: 539—550, October, 1936. 


18 Henry Н. Hartley, “Report Card Trends in West," Nation's Schools, 24: 61-53, 
November, 1939, 


PUBLIC RELATIONS 407 


operative relation between home and school." '? Morrisett*? reports a study 
in which the principal of a large junior high school submitted a list of fort y 
items to the parents with the instruction to check “items in which you are 
most interested; that is, those items about which you would like to know 
more." The item “What parents can do to promote pupil accomplishment” 
ranked first. Other items high in the list clearly indicated that parents 
desired more information regarding educational and vocational guidance. 
The weaknesses of the older report card was just here. The information 
supplied to parents, even if its accuracy could be assumed, was of such a 
general character as to be of little help in either diagnosis or guidance, in 
which full co-operation with the home is most needed. 

Evans” has traced the evolution of the report card. He notes a definite 
trend away from the standardized printed card and toward a more flexible, 
informal report that is better adapted to local conditions and needs. There 
is an increasingly clear recognition that the function of reporting is inter- 
pretation rather than presentation, with the emphasis on progress rather 
than on status. 

Hill’s study of report cards. Hill” analyzed 443 report cards from 
towns and cities of all sizes, representing all educational levels and practi- 
cally every state. He concluded that a satisfactory report card should: 


1. Represent the true spirit, purposes, and functions of the school, . . . 

2. Reflect edueational objectives arrived at only after careful consideration and 
mature judgment. 

3. Change in accord with changes in educational standards and educational 
philosophy. . . . 

4. Present a report of achievement that is broad enough to cover all the im- 
portant educational outcomes—subject achievement, character outcomes and social 
adjustment, health, and use of leisure. 

5. Give an adequate picture of causes as well as of outcomes, . . . 

6. Reflect a complete and sympathetic understanding of the child. 

7. Afford a means of reporting flexible enough to account for the peculiar 
individual abilities of each child. 

8. Give an account of pupil progress understandable and instructing to both 
pupil and parent. 

9. Bring about closer eoóperation and greater mutual understanding of home 
and school. : 

10. Provide for reeiprocal reporting. [That is, space for suggestions and questions 
from the parent.] ‘ 
11. Rate achievement in relation to the basic abilities and capacities of the child. 


У. L. Beggs, “Reporting Pupil Progress without Report Cards," Elementary School 
Journal, 37: 107-114, October, 1936. X | 

? L, N. Morrisett, “Interpreting the School to the Public,” Clearing House, 7: 480- 
485, April, 1933. А д 

21 Robert О. Evans, Practices, Trends, and Issues in Reporting to Parents on the Welfare 
of the Child in School, 98 pages. New York: Bureau of Publications, Teachers College. 
Columbia University, 1938. ae $ 

2 George E, Hill, “The Report Card in Present Practices,” Educational Method, 15: 
115-131, December, 1935. 


408 MEASUREMENT IN INSTRUCTION 


12. Rate achievement by means of valid and reliable marking systems. 
13. Conform to reasonable standards of form and appearance. The report should 
be attractive. 


The ordinary report card often fails to meet the fourth requirement in 
the above list. It tends to neglect the less tangible but important outcomes 
of edueation reflected in social and personal qualities. One advantage of the 
informal report card or letter to parents is that it attempts to inform 
parents on all phases of pupil growth. But it is the spirit of the report 
rather than its form which is important. Indeed a curt note from the teacher 
may be worse than the usual report card. Elsbree? cites the following letter 
from a teacher to the parents of a slow learner which is a good illustration 
of “How to Lose Friends and Influence Parents—in the Wrong Direction": 


Dear Parents, 


Donald has improved in nothing except 
spelling and that very little, 


Sincerely, 


Teacher 


For use in the elementary school, Hill suggests the informal report to 
parents, reproduced with slight modifications in Figure 54. A similar form 
for the second half of the semester calls attention to improvements noted, 
and invites further parental co-operation on other points. Neither the re- 
port itself nor the letter accompanying it makes such demands upon the 
teacher's time as does the personal letter, which should probably be re- 
served for very special occasions. It is always a good idea, of course, to 
apply the grease when the squeak appears. The letter suggested to accom- 
pany the first report is as follows: | 


Dear (name of parent or guardian): 

Now that the semester is one-half over we wish to call your attention to 
АЭРО: 's school progress. The enclosed report covers four kinds of 
progress—progress in school subjects, health and physical condition, attendance, 
and school citizenship. If you would like to talk over the report, or to get more 
complete information on your boy's success in school, we should be glad to have 
you come to see us. If you can telephone us or send a note ahead of time, it will 
make it easier to arrange a meeting. 

The upper part of the report is for you to keep for future reference. Please return 
only the lower part. We are especially anxious to get any information from you that 
will aid us in helping your boy make a complete success of his school work. Any 
information or suggestions you may wish to write will be welcome. 


Sincerely yours, 
(Signed by teacher and 
s principal) 
зз Willard S. Elsbree, Pupil Progress in the Elementary School, page 76. New York: 
Bureau of Publications, Teachers College, Columbia University, 1943. 


PUBLIC RELATIONS Я 409 


REPORT FOR FIRST HALF OF THE FIRST 
SEMESTER 


PROGRESS IN SCHOOL SUBJECTS. ....................... is doing very good 
work in... 5 e E 
His work is ООО. 


His work is poor and needs improvement in .... 


His work in these subjects would probably be improved if ................... 


ATTENDANCE. Half days absent ....... Number of times tardy ...... 


REMARKS. i. nhe NUBE o UT eet У Luis. 
SCHOOL CITIZENSHIP. We believe that every boy should be happy in 
school, should take part in the life of the school; should get along well 


with his elassmates, and should develop good habits of honesty, courtesy, 
neatness, eonsideration for the rights of others, and industry. 


Your boy is especially strong ш 


He could improve nie c E 


(Parent or guardian) 


REMARKS OR SUGGESTIONS. Е а 


Figure 54. А Suggested Informal Report to Parents (After Hill) | 


410 к MEASUREMENT IN INSTRUCTION 


Suggestions for letters to parents. The art of writing effective letters 
to parents will require special training and practice. To assist teachers in 
acquiring this necessary skill, the schools of Santa Monica, California, 
prepared a very helpful list of suggestions.” The list in somewhat abridged 
form is as follows: 


. Begin the letter with encouraging news. 
. Close with an attitude of optimism. 
. Solicit the parents’ cooperation in solving the problems, if any exist. 
. Speak of the child's growth— social, ohysical, and academic. 
a. Social (citizenship traits) 
(1) Desirable traits: attention, care of property, co-operation, honesty, 
effort, fair play, etc. 
(2) Undesirable traits: selfishness, wastefulness, untruthfulness, dishon- 
esty, carelessness, etc. 
b. Physical (health conditions) 
Posture, weight, vitality, etc. 
c. Academic 
(1) Interest in school and extra-school activities. 
(2) Methods of work. 
(3) Achievements: (a) Growth in knowledge, appreciation, techniques; 
(b) list subjects in whieh child is making progress and those in which 
he is not making progress; (3) relationship of his accepted standards to 
his capacities. 
5. Compare the child's efforts with his own previous efforts and not with those 
of others. 
. Speak of his achievements in terms of his ability to do school work. 
. Please remember that every letter is a professional diagnosis, and therefore 
is as sacred as any diagnosis ever made by any physician. 


> оо t = 


ч о 


А more elaborate 21-page manual to guide teachers in the preparation of 
reports to parents was prepared by the Omaha, Nebraska, school system. 

The Colorado experiment. Although’ it is true that the aim of all 
evaluation and reporting to parents is the complete development of the 
child, it is often necessary to “temporize ideals with practical considera- 
tions." 

The experience of the Secondary School of Colorado State College of 
Education is especially instructive.” Detailed analytical evaluation sheets 
were tried and abandoned primarily because of the excessive amount of 
time required to prepare them. The use of the terms unsatisfactory, satis- 
factory, and honors was given up because it was felt that any attempt to 
evaluate pupils both in terms of their own ability and the objectives of the 
curriculum is sure to involve negative reactions. Evaluations of the ordi- 
nary scale type were tried and abandoned because they afford only a par- 


* Ibid., pages 83-84. 5 
* William L., Wrinkle, "The Story of a Secondary-School Experiment in Marking 


ae ee Educational Administration and Supervision, 23: 481-500, October, 


PUBLIC RELATIONS 411 


tial report. Anecdotal records were attempted and discontinued because 
the teachers tended to select unusual activities and experiences instead of 
reporting an ordinary picture of the pupil’s growth and progress. Confer- 
ence meetings of counselor, teacher, and parents, although successful for 
a time, had to be given up because of the failure of the majority of parents 
to respond to the school’s invitation to avail themselves of these conference 
opportunities. The school eventually prepared lists of “statements of trait 
actions” which were indicative of the pupil’s attainment of such general 
school objectives as self-direction, social adjustment, breadth of interests, 
personal attractiveness, care of materials and equipment, basic reading 
skills, and the like. These were then evaluated on a five-point scale, Н,5, 
N,U,O, indicating distinctly superior, satisfactory, needs to make improve- 
ment, unsatisfactory, and no evaluation, respectively. 

The experiment continued for many years. Wrinkle* summarized the 
program as follows: 


In the thirteen years which have elapsed, new forms and new practices have 
been developed, tried, serapped, and replaced by newer forms and practices. 
Detailed analytical reports, scale-type evaluations, the conference plan, anecdotal 
reports, and check-list type reports were developed and discarded because they did 
not do a good job of conveying information or demanded too much time. 

Repeatedly it was discovered that adequacy meant detail and detail meant 
forms which were impractical for use in publie school situations. One criterion 
which resulted in the scrapping of many forms and practices including those which 
were successful in their use in the laboratory school was: Whatever is developed must 
be usable in the public schools by public school teachers. 


In May, 1945, a popular referendum was held in which all high-school 
students participated; the general consensus was highly favorable but sev- 
eral changes were proposed. For example, 99 per cent of the students 
thought they should always be allowed to see their scores on standardized 
achievement tests; also 90 per cent of the students thought that the reports 
to parents should show how the actual achievement compared to the ex- 
pected achievement. 

The University of Chicago High School System of reporting. The 
University of Chicago High School plan illustrates a dual system of re- 
porting. At the end of each semester the parents receive a detailed report 
in terms of the specific objectives of each course and whatever comments 
are deemed necessary. A week or so after the detailed reports are sent out 
and the parents have had an opportunity to study the strengths and weak- 
nesses of the pupil, the course marks are forwarded and are usually accepted 
by the parent as incidental supplementary information. Figure 55 illus- 
trates one of the detailed semester reports in social studies. The Chicago 


* William L, Wrinkle, “Reporting Pupil Progress,” Educational Leadership, 2: 293- 
295, April, 1945. 


412 MEASUREMENT IN INSTRUCTION 
THE UNIVERSITY OF CHICAGO 


* The Laboratory School 
SEMESTER REPORT, SOCIAL STUDIES IIL... E 
СНОВИ | Wo ero NE Soo nd 50 Tocem DENS ӨН oL bos sucia 
Last Name First Name 
Purposes Rating | Comments (if any) 


1. Aequisition of basie information 
2. Reading skills 


a, recognizing main ideas 


b. recognizing pertinent data 


€. social studies vocabulary 
3. Oral Skills 


а. presentation of ideas 


b. organization of ideas 


c. adequacy of content 
4. Writing Skills 


a. organization of ideas 


b. adequacy of content 


. Ability to interpret social data 


. Ability to apply principles in new situations 


. Interest in current affairs 


CO, at] aD] м 


. Courtesy and cooperation in group situations 
Habits of Work 


9. Persistence in overcoming difficulties 


10. Tendency to work independently 


11, Promptness in completing work 


- 12. Application during study 


13. Attention to class activities 


14. Participation in elass activities 


15. Effectiveness in following directions 
Ааа аар Раа аиа МАЗНА НАЕ АРТЕЛЬ НОЧ ПЕНСИОННА 


Pupil's Grade 


Instructor 


Figure 55. A Report Card Used at the University of Chicago High School. 


PUBLIC RELATIONS 413 


system may be regarded аз a desirable transition between the formal report 
cards and the informal letter to parents. 


E. Other Avenues of Public Information 


School exhibits. There is no sounder principle of evaluation than that 
contained in the statement, “By their fruits ye shall know them." Exhibits 
afford one of the best ways of presenting the “fruits” of the school. The 
public is evidently interested in local, county, state, national, and inter- 
national fairs and exhibitions of all types, and schools could make use of 
this fact. Posters and displays of pupils’ work, as well as public programs 
of a dramatic, literary, or musical character, afford concrete demonstra- 
tions of the school’s educational program. Commencement programs in 
which the pupils themselves play the leading roles afford an excellent op- 
portunity for the publie to see the end products of the school. In the final 
analysis, however, the ordinary everyday behavior of the pupil is the best 
evidence of the worth of the school. What the pupil thinks and what the 
pupil says are both important; but what the pupil із speaks a still more 
eloquent language. 

School visitation. Vicarious knowledge is important, but it is usually 
a poor substitute for first-hand experience. Whenever possible, therefore, 
the public should have an opportunity to see their school in actual opera- 
tion. The school should cultivate a reputation for friendliness. The an- 
nounced policy of the school should be, “The latch string is always out.” 
It is a rare parent indeed who would not rather see his own child “perform” 
than witness world-famous actors on television. Furthermore, to observe 
the process of upholstering a chair or fashioning a dress is inherently more 
interesting than merely to look at the finished product. 

The parent-teacher association. The modern educator recognizes 
more clearly than did his predecessor that education is a continuous unified 
process, that several agencies contribute to its accomplishment, and that 
of these the home and the school are most important. It is self-evident, 
therefore, that there should be intelligent and wholehearted co-operation 
between the home and the school. The local parent-teacher association 
seeks through mutual understanding to effect this needed co-operation. 
Atits best, the association is a modern successor to earlier visits of teachers 
to the pupils’ homes and of the parents to the school, both of which are 
increasingly difficult with the growth of the school population and with the 
enlargement of the area served by the individual school. 

From the viewpoint of the home, the association affords an opportunity 
for parents not only to hear about the school’s program and philosophy and 
to see the school in actual operation, but also to react to what they hear 
and see. The modern parent, like the modern child, wants to be heard as 
well as seen. Certainly at all times he is entitled to communicate freely in 
an accepting atmosphere, A free interchange of feelings and ideas may be 


414 MEASUREMENT IN INSTRUCTION 


facilitated by the use of group techniques such as role playing, sociodrama, 
and leaderless group discussion.” 


F. Mobilizing Public Opinion 


Sampling the opinion of parents. To what extent can the judgment 
of parents be utilized in the evaluation and improvement of the school? 
Eells” attempted to use the opinions of the parents of seniors in evaluating 
the secondary schools attended by their sons or daughters. He employed 
a five-point scale ranging from “extremely satisfactory” at one end to 
“extremely unsatisfactory” at the other. Twelve items were included, re- 
lating to the general quality of instruction, development of good character, 
training in good citizenship, guidance activities, and the like. The principal 
of the school personally signed and mailed to the parents of seniors in his 
school a double postal card containing the following message: 


To the Parents of Seniors: 

Our school has been selected as one of two hundred high schools and other 
secondary schools in the United States to be critically studied and evaluated in an 
effort to improve the standards of secondary education throughout the country. 
The study is not connected in any way with the Federal Government. 

One part of the plan for this national study calls for a frank evaluation of the 
school from the standpoint of the parents. We are asking parents of our seniors 
to state their honest opinions concerning certain aspects of our school, as judged 
by the development of their children during their school life here. You are urged 
to express your candid judgment, whether it is favorable or unfavorable. You are 
not asked either to praise or to defend the school, only to judge it. The card need 
not be signed and it is to be sent directly to the headquarters of the study in 
Washington. I shall not see it again. 

I am eager to have a hundred per cent response from the parents of pupils in 
this school. Won’t you fill the card out and mail it promptly? Within a day or 
two, please! 


The study concluded that “the parents, on the whole, showed a marked 
degree of discrimination,” as judged by the scattering of the ratings along 
the scale. Only a quarter of the ratings were “exceedingly satisfactory,” 
and more than 7 per cent were “not very satisfactory” or “exceedingly 
unsatisfactory.” It is significant that the guidance program of the typical 
school was considered least satisfactory, a judgment supported by other 
criteria. Yet perhaps no phase of the school program is more dependent 
for its success upon parental co-operation than is guidance. “Regardless of 
whether parents are correct in their judgments, it is important to know 
what these judgments are,” Hells points out, “for in the last analysis the 

27 Helpful sources of information are Jean E. Grambs, “Dynamics of Psychodrama 
in the Teaching Situation,” Sociatry, 1: 383-399, March, 1948; Herbert A. Thelen, 
"Human Dynamics in the Classroom,” Journal of Social Issues, 6: 30-55, No. 2, 1950; 
and William Clark Trow and others, “Psychology of Group Behavior,” Journal of Edu- 
cational Psychology, 41: 322-338, October, 1950. 


28 Walter Crosby Eells, “Judgments of Parents Concerning American Secondary 
Schools,” School and Society, 46: 409-416, September 25, 1937, 


PUBLIC RELATIONS 415 


parents support and control the schools." Another writer? emphasizes the 
point that although it is important to discover what the publie knows about 
its schools it is even more important to learn what the public feels about its 
schools. 

Concluding statement. It is one of the fundamental beliefs of a de- 
mocracy that reliance can be placed on an enlightened public opinion. It 
is to achieve this end that public schools are maintained. But it is erro- 
neous to assume that the responsibility ceases when the formal period of 
instruction ends. In a changing world the continued enlightenment of the 
adult population is increasingly recognized as a major responsibility of a 
democratic society. No individual or group can be expected to think or to 
act intelligently on anything without the necessary information. To supply 
this information about the schools is the objective of the public relations 
program. At all times the school will do well to keep in mind the words 
of one of America’s ablest statesmen, Abraham Lincoln: 


Public sentiment is everything. With public sentiment nothing can fail, with- 
out it nothing can succeed. Consequently he who molds publie sentiment goes 
deeper than he who enacts statutes or pronounces decisions. 


SELECTED REFERENCES ror FURTHER READING 


Elsbree, Willard S., Pupil Progress in the Elementary School. New York: Bureau 
of Publications, Teachers College, Columbia University. 1943. Chapter VIII. 
Evans, Robert O., Practices, Trends, and Issues in Reporting to Parents on the 
Welfare of the Child in School. New York: Bureau of Publications, Teachers 

College, Columbia University, 1938. 98 pages. 

Froehlich, Clifford P., and Darley, John G., Studying Students; Guidance Methods 
for Individual Analysis. Chicago: Science Research Associates, 1952. 411 pages. 

Rothney, John W. M., and Roens, Bert A., Guidance of American Youth; an 
Experimental Study. Cambridge, Massachusetts: Harvard University Press, 1950. 
269 pages. 

Scott, William O. N., Desirable Objectives for Public Schools—An Opinion Analysis. 
Unpublished Ph.D. Dissertation, George Peabody College for Teachers, Nash- 
ville, Tennessee, 1951. 236 pages. da | 

Smith, Eugene R., Tyler, Ralph W., and Staff, Appraising and Recording Student 
Progress, New York: Harper & Brothers, 1942. Chapters IX-XI. 

Sykes, Gresham M., “Тһе РТА and Parent-Teacher Conflict,” Harvard Educational 
Review, 23:86-92, Spring, 1953. 

Thorndike, Robert L., Personnel Selection; Test and Measurement Techniques. New. 
York: John Wiley & Sons, 1949. Chapter 11, “The Personnel Selection Program 


and the Publie.” 

Traxler, Arthur E., Techniques of Guidance. New York: Harper & Brothers, 1945. 
Chapter XIII. giles tl HEN MEN 
Wrinkle, William L., and Gilchrist, Robert 5., Secondary Education Jor American 

Democracy. New York: Farrar and Rhinehart, 1942. Chapter 39. 
29 Warren C. Seyfert, “What the Public Thinks of Its Schools," School Review, 48; 
417-427, June, 1940, 


17 


Some Present Trends 


In the preceding sixteen chapters and in the six appendixes on pages 
429-465 a multitude of measurement problems receive attention. It seems 
desirable, nevertheless, in this final chapter to present a brief overview of 
current trends. 

Reliability. The extreme emphasis upon high reliability coefficients 
which characterized educational and psychological measurement during 
the 1920's and 1930’s has died down, though when a decision concerning 
an individual’s future status in a certain trait is being based upon a single 
test, considerable stability is needed. For predicting a criterion, several 
well-constructed but short and hence only moderately reliable tests are 
usually better than one relatively more reliable instrument. The short tests 
should correlate with each other as near zero as possible, but each should 
correlate well with the criterion to be predicted. 

"Spuriously" high single-form reliability coefficients may be obtained by 
two different methods, administering the test to an extremely heteroge- 
neous group or applying a split-half or Kuder-Richardson computational 
procedures to a highly speeded test. If the examinees upon whom any reli- 
ability coefficient is based have more variable scores (a higher standard 
deviation) than your testees, then the reliability coefficient secured for 
your group will in all likelihood be lower than theirs. 

Validity. The American Psychological Association’s Committee on Test 
Standards lists four types of validity :! 

! Lee J. Cronbach (Chairman), “Technical Recommendations for Psychological Tests 
and Diagnostic Techniques: Preliminary Proposal," American Psychologist, 7: 461-475, 


August, 1952. Pages 467-468. Quoted with permission of the American Psychologist and 
the American Psychological Association. 


416 


SOME PRESENT TRENDS 417 


1. Predictive validity denotes correlation between the test and subsequent cri- 
terion measures. 

2. Concurrent or status validity denotes correlation between the test and con- 
current external criteria. 

3. Content validity refers to the case in which the specific type of behavior called 
for in the test is the goal of training or some similar activity. . , . An academic 
achievement test is most often examined for content validity. 

4. Congruent [or construct] validity is established when the investigator demon- 
strates what psychological attribute a test measures by showing correspondence 
between scores on the test and other indicators of the state or attribute. 


The Committee states that “the [test] manual should make clear what 
type of inference the validation study supports. No manual should report 
that ‘this test is valid.’ In the past, evidence that is not appropriately 
termed evidence of validity has been presented in the manual under that 
heading.” ? The test user should ask himself and the test salesman, “Valid 
for what?” For instance, does the test predict success in the first year of 
college reasonably well? Does it correlate substantially with current level 
of aspiration? Is it based upon a careful sampling of the content and opera- 
tions in a given set of textbooks, course units, or syllabi?* How highly 
does it correlate with similarly named tests?4 Of course, few tests are valid 
in all four of the above senses, but the user will want to be sure that the 
test has the kind and degree of validity he needs. 

The criterion problem. In recent years the thing-to-be-predicted has 
been shown to be of crucial importance, since even the best possible test 
cannot predict an extremely faulty criterion well. The criterion may not be 
reliable enough; № may not be completely relevant; and it may be imme- 
diate or intermediate, when some more ultimate behavior needs to be pre- 
dicted.5 As an example, take an attitude inventory which attempts to get 
at “good citizenship.” No matter how carefully constructed this instrument 
is, scores obtained on it by the pupils in a given class will not correlate well 
with citizenship ratings assigned to them. Furthermore, the school is 
probably quite interested in the adult citizenship behavior of the former 
student, so unless the immediate eriterion—the teacher’s ratings—is highly 
correlated with the ultimate criterion, the status validity of the inventory 
may differ considerably from its predictive validity. ; 

Ratings may usually be made more reliable by having several well in- 


2 Ibid., page 468. i к 
t See Phillip J. Rulon, “On the Validity of Educational Tests,” Harvard Educational 


Review, 16: 290-296, October, 1946; also available free as Test Service Notebook No. 3, 
World Book Company. An к 

i м Rady in which ests with quite different rationales but verbally similar categories 
intercorrelated as expected is Julian C. Stanley and Robert 8. Waldrop, Intercorrela- 
tions of Study of Values and Kuder Preference Record Scores," Educational and Psycho- 
logical Measurement, 12: 707-719, Winter, 1952. . ADR f р 

5 For а comprehensive discussion of the criterion which is particularly applicable to 
education see Edward E. Cureton, “Validity,” Chapter 16 т E. Е. Lindquist (Editor), 
Educational Measurement. Washington, D. C.: American Council on Education, 1951. 


418 MEASUREMENT IN INSTRUCTION 


formed persons rate each individual and then take the mean of their ratings. 
If the raters have not all had considerable opportunity to observe the ratees 
with respect to the characteristic being rated, however, this process may 
result in some loss of relevance. 

Nearly all of the criteria used in predictive validity studies are inter- 
mediate: success or failure in medical school, rather than competence as a 
practicing physician; passing or failing in flying school, instead of achieve- 
ment in combat; grades in the teacher-training curriculum, not perform- 
ance on the job ten years after graduation; and score on the final training- 
school exam in lieu of competence as an automobile mechanic. Indeed, 
most “ultimate” criterial measures are hard to get, all too unreliable, and 
of doubtful relevance. This is illustrated rather dramatically by the numer- 
ous attempts to determine what a “competent teacher" is.® 

Factor analysis. Since the early 1930’s an increasingly large number of 
measurement specialists have worked both theoretically and practically 
with factor analysis. The well-known Primary Mental Ability tests (PMA)? 
had their origins in factor analyses performed by Louis L. Thurstone, 
whereby he used mathematical methods to identify a few “factors” (Verbal, 
Word Fluency, Number, Space, Memory, Reasoning, and Perceptual 
Speed) which accounted for most of the positive correlations among a large 
number of mental tests. 

The Holzinger-Crowder Uni-Factor Tests for Grades VII through XII 
are based upon factorial studies and contain verbal, spatial, numerical, 
and reasoning subtests. They first appeared in 1952.8 

Factor analysis has also been used frequently to provide information 
concerning what a test battery such as the Differential Aptitude Tests,’ 
the Wechsler Intelligence Scale for Children (ІС), or the Revised 
Stanford-Binet" is measuring. 

Achievement ys. intelligence vs. aptitude tests. The old familiar 
classification of ability tests into three types, achievement, intelligence, and 
aptitude, has been challenged severely by correlational studies, This is 
especially true of the eight Differential Aptitude Tests,” which include all 
three kinds: Verbal Reasoning, Numerical Ability, Abstract Reasoning, 

* Simeon J. Domas and David V. Tiedeman, “Teacher Competence: an Annotated 
Bibliography," Journal of Experimental. Education, 19: 101-218, December, 1950. 

1 Devised by Louis L. and Thelma Gwinn Thurstone, and published by Science Re- 
search Associates. 

* Devised by Karl J, Holsinger and Norman A. Crowder, and published by the World 
Book Company. 

? Jerome E. Doppelt, The Organization of Mental Abilities in the Age Range 18 to 17. 
Contributions to Education, No. 962. New York: Bureau of Publications, Teachers 
College, Columbia University, 1950. 86 pages. 

10 Elizabeth P. Hagen, “A Factor Analysis of the Wechsler Intelligence Scale for 
Children,” American Psychologist, 6: 297, July, 1951. Abstract. 

и Lyle V. Jones, “А Factor Analysis of the Stanford-Binet at Four Age Levels,” 
Psychometrika, 14: 299-331, December, 1949. 


12 Abbreviated DAT, designed for Grades 8-12, and published by the Psychological 
Corporation, 


SOME PRESENT TRENDS 419 


Space Relations, Mechanical Reasoning, Clerical Speed and Accuracy, and 
Language Usage (Spelling and Sentences). Such a battery, with its tests 
standardized on the same individuals, has distinct advantages over single 
aptitude tests assembled by the tester into an ad hoc battery. All percentile 
ranks can be compared directly without complicated subjective attempts 
to allow for quite different norm groups. 

Bennett, Seashore, and Wesman conclude from the high correlations of 
the DAT Verbal Reasoning and Numerical Ability tests with intelligence 
tests that “apparently [they] can serve most purposes for which a general 
mental ability test is usually given in addition to providing differential 
clues useful to the counselor. Hence, the use of the so-called intelligence 
test is apparently unnecessary where the Differential Aptitude Tests have 
already been used." 

Differential prediction. A persistent guidance problem, still largely 
unsolved, is to estimate differential success in a variety of fields. Will John 
“make” a better engineer than lawyer? According to his high school grades 
and intelligence test scores he would probably pass either curriculum in 
college. Can the counselor organize all the information concerning John 
in a way which will enable the counselor to predict with a fair degree of 
confidence that success in one college field is more probable than in another? 
As currently attempted, the solution is attained largely by rulezof-thumb, 
“eommon-sense” procedures which rely heavily upon intuition and “arm- 
chair validity.” If John has high mechanical and scientific interest scores 
on the Kuder Preference Record and high Numerical Ability, Space Rela- 
tions, and Mechanical Reasoning scores on the DAT, while he is somewhat 
lower on the Kuder persuasive category and the DAT Verbal Reasoning, 
Abstract Reasoning, and Language Usage tests, very likely he will be 
counseled toward engineering instead of law. 

This is an unsatisfactory method, however, since it relies too heavily 
upon assumed validity and subjective weighting of the various test scores 
to arrive at a “felt” probability of success in one field versus the other. 
For some time statisticians have been evolving methods of profile and dis- 
criminatory analysis to make differential prediction objective. Though this 
literature has barely touched educational measurement yet, it does seem 
to have great importance for counselors. Perhaps the most easily under- 
stood articles for the interested student to read are Tiedeman’s and 


Rulon’s.4 
1: George K. Bennett, Harold G. Seashore, and Alexander G. Wesman, A Manual for 
the Differential A plitude Tests (Second Edition), page@W1. New York: Psychological Cor- 
poration, 1952. A 
u David V. Tiedeman, “The Utility ol 


f the Discriminant Function in Psychological 
and Guidance Investigations,” Harvard Educational Review, 21: 71-80, Spring, 1951, 
and David V. Tiedeman and Jack J. Sternberg, “Information Appropriate for Cur- 
rieulum Guidance," Harvard Educational Review, 22: 257-274, Fall, 1952. A shrewdly 
humorous approach is Phillip J. Rulon, “Phe Stanine and the Separile: A Fable, 


Personnel Psychology, 4: 99-114, Spring, 1951. 


420 MEASUREMENT IN INSTRUCTION 


The “whole” child. Emphasis has shifted somewhat from strictly 
objective measurement of specific traits to co-operative evaluation and ap- 
praisal of the “whole” child. Such abstract characteristics as “respecting 
the rights of others,” “participating democratically in group activities,” 
and “developing habits of good citizenship” occupy the attention of 
teachers bent upon comprehensive evaluation. Grading each child in rela- 
tion to the class norm is minimized in the “modern” school, where pupils 
compete with their own past records. 

To a considerable extent this “wholistie” approach is congruent with 
developments in educational philosophy and psychology since 1925, though 
at times it has resulted in à flight to complete subjectivism, with consequent 
abandonment and ridicule of objective tests. Some teachers even take the 
setting up of objectives to be synonymous with evaluating these objectives. 
Cureton® and Rulon" have called attention to the extremely loose think- 
ing involved in much current “evaluation.” A quotation from the former 
sets forth this point of view: 

Among the abstractions which we must at present consider intrinsically invalid 
we find most of the action series that go to make up “worthy home membership,” 
“good citizenship,” “democratic attitude,” “loyalty,” and many of the other 
ultimate aims of education. On the other hand “command of fundamental proc- 
esses" does lead to essential agreements. We can fairly well specify the acts per- 
formed in appropriate situations by persons to be designated as having such 

. "command," the acts performed in similar situations by persons who are to be 
labeled as lacking such "command," and the materials upon which the acts are 
to be performed, and the bases upon which the acts are to be classified and scored 
as successful or unsuccessful. Those educators who insist (and rightly, we believe) 
that other aims are at least equally important, and in aggregate probably much 
more important, would advance their cause most rapidly and effectively by setting 
about the task of specifying the materials, actions, situations, and scoring criteria 
implied by the abstract terms which denote these other aims. They will find the 
task difficult but in most cases possible. When they have accomplished it, they will 
find that teachers will use the materials, set up appropriate school situations, and 
teach the desired acts. In those few cases where the task turns out to be impossible, the 
abstract aim must be admitted to be intrinsically invalid. 

Qualitative and semi-quantitative evaluation techniques. In re- 
sponse to the recent emphasis upon evaluating non-intellectual aspects of 
the child’s behavior there have arisen several new procedures. Anecdotal 
records, samples of revealing behavior recorded shortly after their occur- 
rence by the observer and made a part of the child’s cumulative record, 
are widely used by teachers. Various types of ratings—of one’s self, by 
peers, by teachers, and by parents—have become popular.? Some highly 

15 Edward Е. Cureton, op. cit. * i 

16 Phillip J. Rulon, “On the Concepts of Growth and Ability," Harvard Educational 
Review, 17: 1-9, Winter, 1947. 

1" Edward E. Cureton, ор. cit., page 652. Italics added. 

18 Sometimes self-ratings are compared with test scores, as in the study by Julian C, 


Stanley, "Insight into One's Own Values," Journal o Educational Psychology, 42: 399- 
408, November, 1951. i f sychology, 


SOME PRESENT TRENDS 421 


refined standardized rating scales have appeared, but the majority have 
been prepared locally. Check lists are used frequently, too.” 

Somewhere between the tests of ability and sheer rating scales are the 
numerous inventories which employ forced-choice methods. The Allport- 
Vernon-Lindzey Study of Values, illustrated on pages 176 and 190, consists 
of questions to which there are no objectively “right” or “wrong” answers. 
The Kuder Preference Record has triad items, each containing three ac- 
tivities from which the examinee is to pick the one he likes most and the 
one he likes least; obviously, this is & 1-2-3 ranking arrangement. Neither 
of these inventories was prepared in a purely subjective manner, for both 
had to meet certain statistical criteria. 

Various sociometric techniques have been devised for disclosing rela- 
tionships among members of a group or between groups. Frequently these 
are of the “Which five children in the class would you rather sit next to?” 
type. By this means teachers add to their information concerning social 
isolates and. popular individuals, thereby enabling them to individualize 
instruction more effectively. Other sociological innovations which may at 
times be of value to educators are role playing, psychodrama, and socio- 
drama.” Leaderless group discussions for the identification of leaders 
within small groups have been used in industry, but apparently not as yet 
in publie schools.” 

Projective techniques, which until recently were used chiefly with neu- 
rotie or psychotic adults, have in some instances been adapted to the study 
of relatively normal children. A projective instrument allows the subject 
to “project” his anxieties, fears, hopes, and frustrations in a partially un- 
structured situation where he is unaware that he is thus revealing these 
inner feelings. The stimulus for this projection may be a particular set of 
inkblots, as in the Rorschach test; a set of specially prepared ambiguous 
pictures of human beings, as in the Thematic Apperception Test (TAT); 
selected pictures of animals, as in the Children’s Apperception Test (CAT); 
incomplete sentences; Make a Picture-Story Test; Draw-a-Person Test; 
doll play; and many other devices. For a discussion of the “Development 
and Applications of Projective Tests of Personality,” including a 94-item 


pe anis to know more about ratings, questionnaires, and check 
lists TRE XS. Barr, Robert A. Davis, and Palmer O. Johnson, Educational 
Research and Appraisal. Chicago: J. B, Lippincott Company, 1953. 362 pages. 

20 See George Sharp, Curriculum Development as Re-education of the Teacher, 132 pages. 
New York: Bureau of Publications, Teachers College, Columbia University, 1951. 
Arthur Singer reports in a follow-up study on ‘Certain Aspects of Personality and 
Their Relation to Certain Group Modes, and Constancy of Friendship Choices," Jour- 
nal of Educational Research, 45: 33-42, September, 1951. Also pertinent is Herbert A. 
Thelen, “Human Dynamics in the Classroom,” Journal of Social Issues, 6: 30-55, No. 2, 
1950 4 | 
2) h, “Experimental Studies of Small Groups, Psyċhological 

Next Vot tad Bernard M, Bass, “An Analysis of the Leaderless 
Group Discussion,” Journal of Applied Psychology, 33: 527-533, 1949. 


422 MEASUREMENT IN INSTRUCTION 


bibliography for 1949-52, see Rothney and Heimann.” A severe limitation 
of projective techniques for the teacher or administrator is that, not being 
tests in the usual sense, they require for proper administration and inter- 
pretation far more clinical training than the nonspecialist can hope to ac- 
quire. Especially in this area a little knowledge can be a mighty dangerous 
thing. 

Novel tests and items. Some of the newer tests of “вепега] mental 
ability" are mentioned by Stanley.” These include the multi-seore Wechs- 
ler Intelligence Seale for Children (WISC),? an individual test which yields 
two separate IQ's, performance and verbal; the Arthur Adaptation of the 
Leiter International Performance Scale, an untimed test given without 
verbal instructions which should be useful for testing young children with 
physieal and linguistie handicaps; the Northwestern Intelligence Tests, 
developmental scales for infants 4-36 weeks of age that yield IQ’s;** the 
Davis-Eells Games for Grades I-VI, meant to be “fair” to children from 
all socio-economic levels;” and Goossen’s ingeniously disguised six-item 
intelligence test for public-opinion pollsters, which masquerades as an in- 
terview measure of knowledge of current events. 

A widespread recent effort, energized by the Progressive Education As- 
sociation’s eight-year study,” has been to measure understanding rather 
than merely memorization. The Forty-Fifth Yearbook of the National Soci- 
ety for the Study of Education, Part I, entitled “The Measurement of Under- 
standing," *? represents a systematic attempt to outline principles helpful 
in constructing tests that tap this “higher” type of knowledge. Enough has 
already been done with such instruments as the PEA Interpretation of 
Data Test, the Cooperative English Test C (“Reading Comprehension") м 
the Tests of General Educational Development (GED),” and the Watson- 


? John W. M. Rothney and Robert A. Heimann, Review of Educational Research, 
23: 70-84, February, 1953. 

3 Julian C. Stanley, ‘Development and Applications of Tests of General Mental 
Ability,” Review of Educational Research, 23: 11-32, February, 1953. 

*4 Devised by David Wechsler and published by the Psychological Corporation. 

** Devised by Grace Arthur and published by the Psychological Service Center Press. 

** Devised by Adam R. Gilliland and published by Houghton Mifflin Company. The 
use of IQ’s for children less than four or five years of age is open to serious question, 
however. 

27 Published by World Book Company. For mention of the studies underlying these 
tests turn back to page 278. 

* Carl V. Goossen, “The Goossen Hidden Intelligence Test,” Public Opinion Quar- 
terly, 14: 759—766, Winter, 1950, 

? Eugene R. Smith, Ralph W. Tyler, and staff, Appraising and Recording Siuaent 
Progress, 550 pages. New York: Harper and Brothers, 1942. 

зо Chicago: University of Chicago Press, 1946. 338 pages. 

з Devised by Frederick B. Davis, Harold V. King, and Mary Willis, and published 
by the Cooperative Test Division of Educational Testing Service, 

32 Prepared by the Examinations Staff of the United States Armed Forces Institute 
and distributed by the Cooperative Test Division of Educational 'Testing Service. 


SOME PRESENT TRENDS 423 


Glaser Critical Thinking Appraisal? to indicate clearly that a frightened 
retreat to the essay test is not the only way to measure understanding, or, 
indeed, the most desirable one. When properly constructed, objective-type 
items measure much more than merely recognition. For many purposes 
now served by essay and completion tests, objective-type tests would be 
more suitable Zf prepared in accordance with well-known measurement 
principles. This is not to deny that, as often constructed, objective items— 
especially true-false ones—may measure little of importance. 

Information and decision-making. Cronbach* and others are ex- 
ploring the implications for measurement of the recently developed theory 
of decision making. According to their viewpoint, a test should enable 
one to make better decisions than he can make without the test—not just 
better than chance alone, since rarely do we make decisions by sheer 
chance. The educator is concerned with aequiring additional information 
upon which to base his many decisions. Does Mary need extra help with 
reading? Should Jean take chemistry? Shall I concentrate upon helping 
Joe adjust better to the group, or is he already doing well enough? 

A crucial aspect of making an accurate decision centers around how much 
information one already has and how much more a certain test or tests 
can contribute. For instance, should the teacher administer an intelli- 
gence test to his class? Hubbard and Flesher® found the average correla- 
tion between teachers’ estimates of pupils’ intelligence and the pupils’ 
intelligence-test scores to be .72. Hanna's* interview estimates of intelli- 
gence correlated .71 with scores on the ACE Psychological Examination 
and .66 with the Ohio State University Psychological Test, while the 
АСЕРЕ and the OSUPT correlated .77. Thus, if the teacher has plenty of 
testing time available, he can expect by using an intelligence test to gain 
some information concerning this aspect of his pupils that he does not 
already have and perhaps cannot acquire easily otherwise, but the inere- 
ment may not be large. If, on the other hand, test time is limited (as it 
usually is), he may want to administer a test of some other significant 
characteristic, even though it be less reliable and less valid than the 


intelligence test. 


з Devised by Goodwin Watson and Edward Maynard Glaser, and published by 
World Book Company. T 

** Lee J. rone Consideration of Information Theory and Utility Theory as Tools 
for Psychometric Problema, 65 pages. Urbana: Bureau of Research and Service, College 
of Education, University of Illinois, November, 1953. і f 

E Ward Edwards, “The Theory of Decision Making,” Psychological Bulletin, 1954, 
51: 380-417, July, 1954. 209 references. j 1 

3 Robert E. Hubbard and William R. Flesher, "Intelligent Teachers and Intelligence 
Tests—Do They Agree?" Educational Research Bulletin, 32: 113-122, 139-140, May 13, 
1953. pee н 
31 Joseph V. Hanna, “Estimating Intelligence by Interview," Educational and Psy- 
chological Measurement, 10: 420-430, Autumn, 1950, 


424 MEASUREMENT IN INSTRUCTION 


Suppose, for example, that the correlation between the teacher's esti- 
mates of his pupils’ mental health and the best available criterion of mental 
health is only .05, while the mental health test correlates .30 with this 
criterion. Then, even though the test has what is usually interpreted as 
low validity, still it contributes to the teacher’s very meager initial in- 
formation concerning the mental health of his students. Therefore, he 
would appear to be well advised to give the mental health test in lieu of 
the intelligence test if both compete for the same time, particularly when 
important decisions having to do with the mental health of the pupils 
must be made. This approach stresses two considerations, the accuracy 
of the judgment that can be made without the test and the importance 
of the area tested. 

Thus the benefit from a test is not a function only of the test itself, but 
also of the decisions to be made with it. The test is just one step toward 
the goal of efficient decision making. In the classroom situation decisions 
can be changed as further information is acquired. Viewed from this stand- 
point, deciding tentatively on the basis of prior evidence and a low score 
on a mental health test that Bill is having adjustment difficulties does not 
classify him irrevocably as maladjusted. With its validity coefficient of 
only .30, the test will yield quite a few “false negatives,” persons who 
score low on it but are not poorly adjusted. These will be discovered by 
the alert teacher during further screening, when he works with all low 
scorers more closely than heretofore. 

It is important to measure a variety of characteristics, even though some- 
what inaccurately, to know your risk, and to follow through with subse- 
quent checks. Interviews, essay tests, and projective tests are not rifles 
aimed at a narrow target; rather, they are sawed-off shotguns spraying 
rather wildly but frequently hitting the mark, while at the same time 
nicking some innocent bystanders, 

It is too early to tell how much this promising-appearing application of 
utility theory will contribute to measurement. The interested reader may 
follow developments in the journal literature by means of subject and 
author indexes in Psychological Abstracts. 


Бегиствр REFERENCES ror FunrHER READING 


Coombs, Clyde H., A Theory of Psychological Scaling. Ann Arbor: Engineering 
Research Bulletin No. 34, University of Michigan, May, 1952. 94 pages. 

Cureton, Edward E., “The Principal Compulsions of Factor Analysts,” Test Service 
Notebook No. 4, World Book Company, (1949), 

Doppelt, Jerome E., “The Correction for Guessing,” рр. 1-4 in Test Service Bulle- 
tin No. 46, The Psychological Corporation, January, 1954. Free. 

Jahoda, Marie ; Deutsch, Morton; and Cook, Stuart W., Research Methods in Social 
Relations, with Especial Reference to Prejudice. New York: The Dryden Press, 
1951. Parts One and Two, 


SOME PRESENT TRENDS 425 


Lord, Frederic, “A Theory of Test Scores," Psychometric Monograph No. 7, 1952. 
84 pages. . 

“Technical Recommendations for Psychological Tests and Diagnostic Techniques," 
Psychological Bulletin, 51: 1-38, March, 1954. 


Appendices 


APPENDIX 


A 


Fifty Questions to Help You 
Learn Statistics 


The following multiple-choice questions are designed to help you improve your 
knowledge of the material in Chapter 3. Each has five options, only one of which is 
meant to be correct. Consult your book freely while answering them. Rather than 
write in your book, copy the question numbers on a separate sheet of paper and 
put after each number the letter preceding the correct option (A, B, C, D, or E). 

Unless you work on the questions diligently, they will probably not increase 
your understanding very much. After completing them all as well as you possibly 
can, turn to pages 450-463, where answers and explanations appear. Your per- 
centage score, corrected for chance, equals twice the number right minus one-half 
of the number wrong: 

hc w 
100 — 4 | = 2R — 4W. An explanation of the В — a formula appears on 
50 
‘pages 156-158. 


Test Scores! f d fd Ја? 
310-319 1 5 5 25 
300-309 2 4 8 32 
290-299 4 3 12 36 
280-289 1 2 2 4 
270-279 6 1 Oy. 6 
260-269 12 0 
250-259 1 —1 -11 1 
240-249 8 —2 —16 32 
230-239 2 Е 16 18 
220-229 0 = 
210-219 3 —5 —15 75 
N=50 ` 2 = —15 хуй? = 239 


1 This f; distribution, illustrating the "Calculation of SD by Use of a Guessed 
Wwe FS "Table 5 on page 22 of Quinn McNemar’s Psychological Statistics 
(New York: John Wiley & Sons, Inc., 1949). 

429 


130 


bo 


= 


APPENDICES 


The First 14 of the Following Questions Refer lo the Above Test Scores 


. (У means “the sum of,” then Ef = 


А. —15 
Be 1 
C. 50 
D. 239 
E. 315 


- The size of the interval of each class 


in the above distribution is 


А. 4.5 
В. 5.0 
С. 9.0 
D. 10.0 
Е. 10.5 


‚ The fractional class limits of the 


highest class are 


A. 309.50-319.50 
В. 309.50-318.50 
С. 309.95-819.95 
D. 310.50-818.50 
Е. 310.50-319.50 


. The midpoint of the middle class 


(260-269) is 


А. 259.5 
B. 264.5 
C. 265.0 
D. 269.0 
E. 269.5 


- The arithmetic mean, 


M EXHI, ig 


260.3 
261.5 
263.0 
264.5 
267.5 


Бося» 


. The medfan (50th percentile) is 


258.7 
259.5 
. 200.3 
. 264.5 
267.3 


вор 


. The mode is 


А. 120 
В. 259.5 


9. 


10. 


Ц. 


12. 


18. 


С. 260.0 
D. 264.5 
Е, 265.0 


. The 25th percentile (Qi) is 


A. 230.1 
B. 240.4 
С. 244.5 
D. 248.9 
E. 249.5 


The 75th percentile (Qs) is 
A. 267.0 


В. 269.8 


С. 270.2 
D. 272.0 
Е. 274.5 


Q, the semi-interquartile range, 


QQ 
RES 


8 
12 
17 
. 20 
23 


бов 


The 10th percentile is 


245.8 
240.0 
. 239.5 
. 239.0 
. 234.5 


BOO 


The 90th percentile is 


A. 299.5 
B. 299.0 
C. 294.5 
D. 277.8 
E. 276.1 


0.4D — 0.4 X (90th percentile — 
10th percentile) — 


A. 20 
B. 22 
С. 25 
D. 31 
E. 55 


14. 


cs 


15. 


16. 


18. 


QUESTIONS ON STATISTICS 


Standard deviation — 
PX УМЗ — Quy _ 
N = 


2 
15 
17 
20 
22 


In a frequency distribution, the size 
of the interval of the class whose 
lower and upper real limits are 9.5 
and 19.5 is 


11.0 
10.0 
9.0 
5.0 
4.5 


HOOP 


HOW 


In a frequency distribution, the 
midpoint of the class whose lower 
and upper real limits are 99.5 and 
109.5 is 


A. 107.0 
B. 105.0 
C. 104.5 
D. 102.5 
E. 102.0 


- The main reason for grouping data 


in class intervals as a step toward 
carrying out calculations of statis- 
tical measures by hand (that is, 
without using à mechanical calcu- 
lator) is to 


A. reduce the amount of labor 
involved. 

. reduce the frequency of clerical 

errors. 

. permit the calculation of meas- 
ures other than the mean. 

. bring out important trends in 
the data. 

. hide the identity of the persons 
tested. 


PaO. Bd 


In making a frequency distribution 
from raw data for computational 
purposes, the first step is to 


A. determine the range of. scores. 
B. determine the  whole-number 
limits of the classes. 


19. 


20. 


431 


C. determine the real limits of the 
classes, 

D. decide upon the number of 
classes. 

E. select the class interval. 


What is the most serious criticism 
to be made of the following fre- 
quency distribution of test scores, 
where the "real" class limits are 
fractional? 


Whole-N umber 
Class Limits Frequency 
44-48 1 
40-44 2 
36-40 0 
32-36 0 
28-32 2 
24—28 6 
20-24 5 
16-20 23 
12-16 24 
8-12 37 
4- 8 33 
0-4 25 
—4- 0 3 
N = 161 
A. Negative scores occur. 
B. Thirteen classes are used. 
C. There are too many low scores. 
D. The class midpoints are not di- 


visible by the range of scores in 
each interval. 

. The whole-number class limits 
overlap. 


El 


The 60th percentile is the point in 
a distribution 


A. where a student has answered 
40 per cent of the questions in- 
correctly. 

B. which marks the distance from 
the median that includes 60 per 
cent of the cases. 

C. below which are 40 per cent of 
the cases. 

D. below which are 60 per cent 
of the cases. 

E. above which are 60 per cent of 
the cases. 


432 


21. The midscore of the following scores 


22. 


23. 


24, 


25. 


26. 


(4, 6, 7, 5, 4) is 


A. 6.0 
B. 5.5 
C. 5.2 
D. 5.0 
E. 48 


Given seven scores: 40, 68, 44, 46, 
46, 63, and 68. How many points 
difference is there between the mid- 
Score and the median when the 
median is computed from a fre- 
quency distribution of these scores 
where the class interval is 1? 


A. 17.00 
B. 0.67 
C093 
D. 0.17 
E. 0.00 


The arithmetic mean of the follow- 
ing scores (4, 5, 7, 6, 4) is 

A. 6.0 

B. 55 

С. 5,2 

D. 5.0 

E. 4.8 


The measure of central tendency to 
use when reporting data concerning 
wages in order to avoid the undue 
influence of a few extreme salaries is 


the 
A. standard deviation, 
B. quartile deviation, 
C. median, 
D. range. 
E. mean. 


The term “average” as used in 
arithmetie textbooks refers to 


. variability. 

. the mode. 

. central tendency. 

. the median. 

. the arithmetic mean. 


Boae 


From the standpoint of statistics, 
the term that means the same thing 
as "average" is 

А. normal. 

B. median, 


27. 


28. 


29. 


APPENDICES 


C. mode. 
D. central tendency. 
E. mean. 


What is the arithmetic mean of the 
following distribution? 
Score of; d 


0-3 7 0 
4-7 3 


fd 


М = 10 
А. 3.5 
В. 3.2 
07 
Р”. 2.4 
E. 2.2 


= (Po — Pio). This is a measure 


D 
о 
А. variability. 

B. correlation. 

C. central tendency, 

D. averageness. 

E. modality. 

The percentage of scores lying be- 
tween Q; and the median is 


25 

34 

50 

. 68 

. & variable quantity that de- 
pends upon the score distribu- 
tion. 


Boos» 


. What is the standard deviation of 


the following distribution? 


Deor а fg. of 
4 3 
2 4 
0 3 


31. 


32. 


33. 


34. 


QUESTIONS ON STATISTICS 


For the following distribution, the 
quartile deviation or semi-inter- 
quartile range, Q, is 


Score fÍ 

8 1 

6 2 

5 4 

3 4 

2 3 

0 2 

N=16 
A. 2:17 
B. 2.08 
C. 1.79 
D. 1.54 
E. 1.04 


For the distribution in the preceding 
item, the median is 


A. 1.75 
B. 3.00 
C; 3.25 
D. 3.62 
E. 4.00 


On a test with a standard deviation 
of 20 and an arithmetic mean of 80, 
an individual with a raw score of 70 
will have a z-score of 


А. —10.0 
В. —0.5 
C. -01 
D. 0.5 
E. 5.0 


What rank should be assigned to & 
score of 95 in the following distri- 
bution, if the rank of the lowest 
score, 93, is 9? 


Score- 


36. 


37. 


433 


C. 7.0 
D. 4.0 
E. 5.0 


How does the mean of N consecu- 
tive untied ranks compare with the 
mean of N ranks in which one or 
more ties occur? 


A. Former is larger. 

B. No difference. 

C. Latter is larger. 

D. Depends upon the number of 
ties. 

E. Depends upon where the ties 
oceur. 


The Pearson product-moment co- 
efficient of correlation, rz, or ry, 
may vary between 


А. —2.00 and +2.00 
B. —1.00 and +1.00 
C. —0.92 and +0,92 
D. 0.00 and +1.00 
E. 0.00 and infinity. 


A teacher computed a correlation 
coefficient between scores on a read- 
ing test and scores on the Coopera- 
tive Test of Contemporary Affairs, 
obtaining a value of .92. She was 
justified in concluding that, as 
measured by these two tests, 


A. knowledge of current affairs and 
reading ability are closely re- 
lated. 

B. knowledge of current affairs and 
reading ability are unrelated to 
each other. 

C. knowledge of current affairs and 
reading ability are perfectly re- 
lated, 


_D. the coefficient must have been 


computed incorrectly. 

E. wide knowledge of current af- 
fairs is the result of good reading 
ability. 


. Which one of these r's has the least 


predictive value? 


A. 91 
В. .50 
C. ales 
О, =.23 
E. —1.00 


4M 


39. A student computed a Peamon 42. mM TU pe 


t coefficient of cor- 
relation, r,,, between paired scores 
in two distributions, X and Y, and 
found it to be 1.05. We are abso- 
lutely certain that 
A. he has freakish data. 

B. he should have computed Spear- 
man's rank-difference coefficient 
of correlation, rho, instead. 

C. the means of the two distribu- 
tions differ. 

D. se correlation between X and Y 


E. the r has been computed in- 
correctly. 


. If the X distribution is divided into 
12 classes and the Y distribution is 
also divided into 12 classes, the 
number of tally marks in the scatter 
diagram will be 


A. N 
B. 2N 
C. 12 
D. 24 
E.144 


Multiple-Choice Analogies (41-50) 


Directions: Each of the following ten 
items represents an analogy. In every 
case the first two terms of the item are 
related to each other in some way, and 
the third term is related in the same way 
to one of the last five. 


Example: Shoe is to foot as hat is to 


thead,” H 


1. Arithmetic mean is to central tend- 
ency as standard deviation is to 


APPENDICES 


45. 0: 


46. 


47. 


T D. percentile. 
B. 75th percentile, 
C. 50th percentile, 
D. 16th percentile, 
E. 10th percentile. 


. Arithmetic mean is to д as med 


is to 


. Frequency distribution is to medis 


as ungrouped measures 
order of magnitude are to 


A. range. 

B. mode. 

C. class interval. 
D. mean. 

E. midscore. 


— 0, is to 50% as, for a “non 
distribution, Mean + 18D is to K 


A. 32% 
B. 34% 
C. 50% 
D. 68% 
E. 84% 


Arithmetic mean is to mode as 
to 


Median is to point as erm de- 
viation is to 


A. volume. 
B. distance. 
C. square. 

D. score. 


QUESTIONS ON STATISTICS 
48. Positive correlation is to direet a Е . 


negative correlation is to 2 л 
А. incomplete, b 
B. inconsequential. Е Мы 
C. incorrect. 50. Rank is to order as score is to 
D. inadequate. "m 
E. inverse, disorder. 
n 
49. Spearman is to p as Pearson is to LI 
А. г E. variability. 


APPENDIX 


B 


A Simplified Item-Analysis Procedure 


A. Preparing the Items 


The two characteristies usually determined for a test item are difficulty and 
discrimination. How hard is the item for the group tested, and how well does it 
distinguish between the more able and the less able students? These two aspects 
of an item are nearly independent of each other, the exception being that a very 
easy or very hard item cannot diseriminate well. If all testees mark the item cor- 
rectly, it has not separated the testees into two groups, the passers and the failers. 
Likewise, if all mark it incorrectly (or if only a chance proportion of examinees 
mark it correctly), the item is non-discriminating for the group. 

In the following paragraphs, a simple method for analyzing items is presented 
and illustrated in considerable detail. Preferably, the test or subtest should contain 
items of only one type (for example, four-option multiple-choice). There should be 
a considerable number of such items—say, arbitrarily, 50 or more—and they should 
have been administered to a substantial number of persons. Thus the procedure 
works best with final examinations prepared cooperatively by several teachers and 
given to large groups, and less well with daily or weekly tests in a single school 
class. Even for the latter situation it has some value, however. 

Each testee should be strongly encouraged to answer every item for which he has 
any information whatsoever, even a vague hunch concerning a single one of the 
options? Also, plenty of time should be available for every examinee to try every 
item. 

The items should be arranged as nearly as possible in ascending order of difficulty. 
This can be done fairly well on a subjective basis, or if the item has been ad- 
ministered previously to a similar group, the original difficulty values may be used. 
Subjective estimates of relative difficulty based upon the average of independent 
rankings by three qualified persons usually approximate the actual ranks better 
than an ordering made by only one person. 


! Before using this appendix, the student will want to read the material on pages 
117-119 carefully. 
*See the discussion on pages 153-154, 


436 


SIMPLIFIED ITEM-ANALYSIS 437 


Test items should be prepared according to content specifications agreed upon 
by the teachers who will use them. If possible, each item should be typed on a 
5x8 card, with the answer not indicated on the front of the card. In fact, for pur- 
poses of criticism and editing it may be well not to have the answer on the back 
of the card, either. A separate answer key may be better. 

Each item should be constructed with great care, special attention being given 
to the incorrect options (called distracters, decoys, or foils). It should then be keyed 
and criticized on a separate sheet independently by each person helping to devise 
the test. This editing is extremely important and should be followed by a detailed 
conference to reconcile differences, remove ambiguities, and discard items that can- 
not be revised properly. Though time-consuming, a cooperative approach to test 
construction pays off by increasing reliability and validity and improving the morale 
of test-takers. 

Statistical item analysis із no substitute for meticulous care in planning, constructing, 
criticizing, and editing items. It does supplement that intuitive process, however, by 
revealing unsuspected defects or virtues of the specific items. 


B. A Measure of Discrimination 


After the test has been given, score the papers or answer sheets by marking with 
a red pencil all items incorrectly answered or omitted. Because of the instructions 
concerning omissions, they should be few. Each pupil’s score will be the sum of his 
red marks—the smaller the better. 

Divide the papers or answer sheets into three piles, as follows: 


1. Arrange the М papers by score, beginning on top with the smallest number 
(best score) and going on down to the largest number (poorest score). 

2. Multiply N, the total number of testees, by 0.27 and round off the result to 
the nearest whole number, or look in Table 46 on pages 448-450 for the appro- 
priate figure, called n there. 

3. Count off the n best papers from the top of the stack. This is the “high” group. 

4. Count off the n poorest papers from the bottom of the stack. This is the “low” 
group. 

5. "Put aside the middle group (approximately 46 per cent of the papers), since 
16 is not used in the item-analysis. 

6. Set up a form somewhat like Table 42, with Item Number, Wz, Wz, 
Wr — Wn, and Wz + Ин headings. ра 

7. Wz is the number of persons in the low group who answered a certain item 
wrongly, including those who omitted it. It represents the total number of red 
marks for that item in the low group. ; 

8. Инв the number of persons in the high group who answered the item wrongly, 
including those who omitted it. It represents the total number of red marks for 
that item i i roup. 

9. Wi hes. m. minus Wz” for a given item. Wz + Wa means “Wy, 
plus Wx” for that item. 


The larger Wz — W is, the more discriminating power the item has. For editing 
purposes it is well to arrange the items from least discriminating—and therefore 
in greatest need of scrutiny—to most discriminating. A few items may have nega- 
tive И, — Wy values, indicating that more persons in the high than in the low 
group missed the item. Such items may be mis-keyed, ambiguous, or unrelated in 
content to the rest of the test. Ў : $ 

For convenience, a critical value of Wz — Wz at or above which the item is 
considered suitably discriminating may be determined from Table 46 on page 448. 


TABLE 42 


Тне 100 Irems Іх a Five-Oprion MuvrIPLE-CHOICE TEACHER-MADE Test 
ARRANGED ACCORDING TO DISCRIMINATING POWER, FROM THE 
Least DISCRIMINATING TO THE Most DISCRIMINATING 


— 
Estimated 
Rank Order Percentage of 
of Item Heaminees 
According to Wi — Wu Who Did 
"sog Discriminating Wr Wu (Discrimi- | Wr + Wx | Not “Know” 

Power nation) the Correct 

(1 = Poorest Answer to 

Discrimination) the Item* 

(Difficulty) 
30 1 31 40 29 71 67 
35 2 1 2 —1 3 3 
27 3 38 39 —1 77 73 
38 4 0 0 0 0 0 
31 5 38 37 1 75 71 
34 6 28 26 2 54 51 
42 7 4 1 3 5 5 
72 8 17 14 3 31 29 
32 9 5 0 5 5 5 
60 10 § 0 5 5 5 
29 11 9 4 5 13 12 
39 12 14 9 5 23 22 
94 13 6 0 6 6 6 
45 14 8 1 vf 9 9 
28 15 49 42 m 91 86 

8 16 58 51 7 109 103** 
33 17, ЗІ 23 8 54 51 
44 18 10 1 9 11 10 
51 19 10 1 9 11 10 
86 20 10 1 9 11 10 
92 21 16 7 9 23 22 
81 22 13 2 1 15 14 
14 23 16 5 п 21 20 
74 24 24 13 п 37 35 
57 25 26 15 п 41 39 
46 26 12 0 12 12 n 
40 27 13 1 12 14 13 
48 28 15 3 12 18 17 
67 29 15 2 13 17 16 
52 30 30 17 13 47 45 
20 31 48 35 13 83 79 
21 32 54 41 15 95 90 
76 33 15 il 14 16 15 
68 34 16 2 14 18 17 
59 85 17 3 14 20 19 
19 36 22 8 14 30 28 
62 37 18 3 15 21 20 
70 38 21 6 15 27 26 
61 39 28 13 15 41 39 
43 40 37 22 15 59 56 
* s 
-100x0 (Wi + Wa) = 500 


2n(0 — 1) 132 x 4 e + Wu) = 0.947(Wz, + Wy) 


™ Slightly fewer persons answered Item No. 8 correctly than would be expected on 
(he basis of ehance alone, 


438 


TABLE 42 (Continued) 


— ее. геи 


Пет 
Number 


Rank Order 
of Item 
According to 
Discriminating 
Power 
(1 = Poorest 
Discrimination) 


41 
42 
43 
44 
45 
46 
47 
48 


WL- W; H 
Wr Wa (Discrimi- | Wz + Wa 
nation) 
37 22 15 59 
43 28 15 71 
57 42 15 99 
16 0 16 16 
18 2 16 20 
18 2 16 20 
23 7 16 30 
18 1 17 19 
18 1 17 19 
19 2 17 21 
19 1 18 20 
23 5 18 28 
29 11 18 40 
25 6 19 31 
27 8 19 35 
39 20 19 59 
46 27 19 73 
51 32 19 83 
23 3 20 26 
25 5 20 30 
26 6 20 32 
45 25 20 70 
25 4 21 29 
35 14 21 49 
59 38 21 97 
53 31 22 84 
26 3 23 29 
31 8 23 39 
37 14 23 51 
26 2 24 28 
51 27 24 78 
30 5 25 35 
33 8 25 41 
33 8 25 4l 
35 10 25 45 
38 13 25 51 
40 15 25 55 
27 1 26 28 
37 11 26 48 
42 16 26 58 
47 21 26 68 
30 3 21 33 
32 5 21 erf 
36 9 21 45 
37 10 21 47 
42 15 27 57 
29 1 28 80 


(Difficulty) 


Estimated 
Percentage of 
Examinees 
Who Did 
Not "Know" 
the Correct 
Answer to 
the Item 


439 


440 APPENDICES 


TABLE 42 (Continued) 
— 


Estimated 
Rank Order Percentage of 
of Item Examinees 
liem According to Wi — Wn Who Did 
‘Nab Discriminating Wr Wa (Discrimi- | Wr + Wg | Not “Know” 
МХ Power nation) the Correct 
(1 = Poorest Answer to 
Discrimination) the Item 

(Difficulty) 
6 88 39 11 28 50 47 
96 89 36 7 29 43 41 
50 90 40 1р 29 51 48 
85 91 52 23 29 75 71 
73 92 32 2 30 34 32 
99 93 43 13 30 56 53 
87 94 38 T 31 45 43 
93 95 54 23 31 77 73 
15 96 37 5 32 42 40 
88 97 37 2 35 39 37 
2 98 43 8 35 51 48 
12 99 41 4 37 45 43 
90 100 49 8 41 57 54 


"Then high-low group data for every option of each unsuitably discriminating item 
may be secured to aid in the editing process. 


C. A Measure of Difficulty 
The larger Wz + Wry is, the harder the item was for the group tested. Wz + Wa 


may be multiplied by a constant, gs to obtain an estimate of the diffi- 


culty of the item, corrected for chance; here О is the number of options each item 
has. This approximates the percentage of the testees who did not "know" the 
correct answer. Items in the revised test should be arranged according to Wr + Wa, 
from lowest (easiest) to highest (hardest). 


D. An Illustrative Analysis 


Table 42 shows Wz, Wx, Wr — Wx, Wr + Wz, and diffieulty values for each 
of the 100 items on an English final examination constructed by four college in- 
structors and administered with “do-guess” instructions to 243 college freshmen 
at the end of the winter quarter. The item numbers have been rearranged, the least 
discriminating item now coming first and the most discriminating one last. 

When N = 243, 0.27N = n = 66, so there are 66 persons in the low group and 
66 in the high group. Therefore, the maximum possible value of Wz — Wa is 
66 — 0 = 66, and the minimum possible value is 0 — 66 = —66. These figures 
would probably never occur except because of mis-keying, however, for even by 

- chance about 4th of the examinees who attempted the item would mark it correctly. 


? Quite a few test experts do not favor correcting item difficulty indexes for "chance." 
For a discussion of this point see Frederick B, Davis, “Item Selection Techniques,” 
Chapter 9 п E. Е. Lindquist (Editor), Educational M. easurement, pages 267-285. Wash- 
ington, D. С.: American Council on Education, 1951, 


SIMPLIFIED ITEM-ANALYSIS 441 


Thus the highest value for Wz — Wy, barring omissions, mis-keying, or extreme 
misinformation is [66 — $(66)] — 0 = 52.8; the lowest is —52.8. 

In practice, there will probably be few large negative values, since most items have 
at least a little positive discriminating power. Only three negative Wz — W figures 
occur in Table 42; the largest of these is —9. The greatest positive Wz — Wa 
value in the table is 41. It will be instructive to examine these two items (Numbers 
30 and 90) carefully in order to determine why they differ radically in discriminating 
power. Let us take the least discriminating item first: 


30. In preparing a speech the first step is to choose a subject. The speaker should 
then 


A. practice. 

B. collect material. 
C. choose gestures. 
D. select main points. 
E. phrase a thesis. 


Responses by the high and low groups (66 testees in each) were as shown in 
Table 43. The keyed answer was B, “collect material,” but E, “phrase a thesis,” 
appealed more to those students who earned high scores on the test as a whole. 
Options A and С (“practice” and "choose gestures") were practically useless, since 
they decoyed only 3 of the 132 persons. Option D, "select main points," discrimi- 
nated in the proper direction, 8 to 18. The item as a whole was fairly difficult, since 
67 per cent of the freshmen did not “know” the correct answer. 


TABLE 43 
Момвев or Examiners IN Hicem AND Low Groves Wao Снозв Блсн OPTION oF 
Ттем No. 30 
Option. Number 
Group у of 
A B C D E Omit | Examinees 
High 1 26 0 8 31 0 66 
Low 1 35 H 18 11 0 66 
Toials 2 61 1 26 42 0 132 


By using the above information, the speech teacher may be able to salvage the 
item without destroying its main point. He would try to determine why Option E 
attracted the better students, This may indicate the need for additional classroom 
instruction concerning steps in preparing a speech, or it may highlight a real con- 
flict between B and E as the correct answer for the item. If the dilemma can be 
resolved, new distracters will then be devised to replace ineffective Options A and C. 

On the other hand, it may not be feasible to retain the item. Not all poorly dis- 
criminating questions can be revised successfully. Sometimes the point tested is 
not clear or defensible enough to serve as the basis for an item. Therefore, more 
items for each part of the test outline should be prepared than will be needed in the 
revised test, so that virtually unrevisable items may be discarded. How many 
excess items are needed depends upon the nature of the test, the purposes for which 
it will be used, and the care devoted to initial construction and editing. Some items, 
such as those concerning vocabulary, are much easier ta prepare well than are 


others, such as civics questions. 


442 APPENDICES 


Now let us turn to the most discriminating item, No. 90: 


“Humanity is the mould to break away from, the crust to break through, the 
coal to break into fire, the atom to be split" is a quotation from 

A. John Dos Passos. 

B. Carl Sandburg. 

C. Robinson Jeffers. 

D. Kenneth Fearing. 

E. Sherwood Anderson. 


Numbers of responses to the various options are shown in Table 47; the keyed 
answer is C, “Robinson Jeffers.” Note that all four distracters (A. B, D, E) dis- 
criminate in the right direction and reasonably well; each is more attractive to the 
low group. Approximately 54 per cent of the testees did not "know" the correct 
answer. This item does not need any editing. 

How many of the items should be edited on the basis of option information like 
that contained in Tables 43 and 44? Probably most of them could be improved in 
this manner, especially by the substitution of better distracters for nonfunctioning 
ones, but the labor involved in this process is too great for most teachers unless 
only a small portion of the items are scrutinized. A rule-of-thumb procedure would 
be to edit the 25 per cent least discriminating items. For Table 42, where there are 


TABLE 44 
NUMBER ов ExAMINEES IN Hiem AND Low Groups Wao Снозв Eacu Oprion оғ 
Ттем No. 90 
Option Number 
Group of 
A B с р Е Omit | Examinees 
High 2 1 58 3 2 0 66 
Low 14 8 17 15 11 1 66 
Totals 16 9 75 18 13 1 132 


100 items, this involves taking the first 25 items, whose Wz — Wz values are less 
than 12. For these 25 items, information like that contained in Table 48 would be 
drawn up by two conscientious persons (even high-school students) working to- 
gether. Then the subject matter experts would edit carefully on the basis of the 
discrimination, difficulty, and option information given in Tables 42 and 45. 

To provide you material upon which to practice editing, the 25 least discrimi- 
nating items are presented herewith. They still contain spelling and typographical 
errors which appeared on the test itself. Of course, these should have been removed 
by careful proofing of the stencils before mimeographed copies were run off. 


30. Already appears on page 441. 
35. The subject you choose for a talk should be 
A. one that is wholly new to you. 
. one that interests you but about which you know nothing. 
. anything for which you can find material. 
. anything which will find the required time. 
. one that interests you and about which you already know something. 


Босе 


SIMPLIFIED ITEM-ANALYSIS 443 


TABLE 45 


Момвев or ExaauiNEES IN Нтон Ap Low Groves Wao Сноѕв Eacm OPTION oF 
THE 25 Least DISCRIMINATING ITEMS 


" Numbei 

Item Gr Option pleated 

Number | "04Р (Check 
A D Omit | Column) 

30 H 1 8 о | в 
L 1 8 0 66 
35 H 1 0 Ai 66 
ji 0 0 0 66 
27 H 31 7 0 66 
L 26 8 0 66 
38 H 66 0 0 66 
L 66 0 0 66 
31 H 29 3 5 0 66 
L 28 5 6 0 66 
34 H 8 18 0 0 66 
L 10 12 0 1 66 
42 H 0 1 0 0 66 
L 0 0 0 0 66 
72 H 5 0 52 8 0 66 
if 9 0 49 3 1 66 
32 H 66 0 0 0 0 0 66 
L 61 3 1 0 0 2 66 
60 H 0 0 | 66 0 0 0 66 
L 3 1 6l 1 0 0 66 
29 H 3 1 0 62 0 0 66 
L 6 0 1 57 2 0 66 
39 H 0 57 0 2 7 0 66 
L 0 52 5 2 T B d 
94 H 66 0 0 0 0 p 08 
L 60 3 1 1 cire шс 
45 8877 65 0 0 0 1 0 56 
1, 58 1 2 1 e e s 
zog 2 RN еа: Вазо ke 
7 0 4 1 17 32 2 66 
8 H 13 15 9 21 8 9 96 
у, 8 8 6 34 10 0 66 


444 APPENDICES 


TABLE 45 (Continued) 


f Number of 
Item Option Examinees 
Number Group (Sol IC Te) T Re | = (Check 
A B С р Е Omit | Column) 
33 H 1 43 21 0 1 0 66 
L 0 35 25 2 3 1 66 
44 H 65 0 1 ges. o 0 66 
L 56 1 8 1 0L 7] 4:0 66 
51 H 0 65 1 0 0 0 66 
L 2 56 2 5 1 0 66 
86 H 0 ОТТІ 0 0 65 ul 66 
L 1 2 2 4 56 1 66 
92 H 0 0 1 5 59 1 66 
L 3 0 1 11 50 1 66 
81 H 0 0 1 1 64 0 66 
L 0 0 1 12 53 0 66 
14 H 1 ól 1 3 0 07 66 
L 9 50 2 4 1 0 | 66 
74 H 3 53 | 2 6 | 2 0 66 
L 6 42 6 9 2 1 66 
57 H M A 0 4 51 2 0 66 
L 11 2 5 40 ^6 2 66 


27. In a panel discussion the members of the panel 

. deliver prepared speeches. 

. ask questions of the audience. 

. provide informal discussion for audience. 

. answers questions of the moderator. 

. speak in rotation from right to left. 

38. The best material for a speech 
A. holds interest and develops thesis. 
B. bores audience but develops thesis. 
C. pleases the speaker but annoys the audience, 
D. pleases audience but ignores purpose of speech, 
E. is unreliable but develops thesis. 

31. A speaker whose purpose is to instruct should begin by 
A. showing why the information is needed. 

. telling a funny story. 

. stating his thesis. 

. putting a diagram on the board. 

. Stating his qualifications to speak. 

34. The fundamental process under which is included methods of organizing à 
speech is 
A. adjustment to the speaking situation, 


BOOS 


вов 


72. 


32. 


60. 


29, 


39. 


94, 


45. 


Q 


SIMPLIFIED ITEM-ANALYSIS 445 


articulation. 

choice of material. 

symbolic formulation and expression. 

. phonation. 

ood posture for a speaker should be 

. rigid and stiff. 

. oratorical and pompous, 

comfortable and natural. 

odd and unusual. 

lax and undisciplined. 

“American Letter" MacLeish expresses 

. loyalty to a foreign land. 

. disgust with American industry. 

disgust with American tradition, 

. loyalty to America, 

a desire to leave America. 

choosing a subject, a speaker should try to find one which 

he is or can become enthusiastic about. 

. he dislikes but which may please the audience. 

. will annoy his audience. 

. will require little preparation. 

E. he has seen in a popular magazine. 

“Roan Stallion” was written by: 

A. Oswald Spengler. 

B. Aldous Huxley. 

C. Robinson Jeffers. 

D. Leonard S. Brown. 

E. George Boas. 

In public discussion, he makes the best contribution who 

A. argues down all objections to his proposals. 

B. does the most talking. 

C. listens attentively but says nothing. Ў 
D. makes a creative adjustment between conflicting points of view. 
E. refuses to permit the expression of conflieting points of view. 
The best kind of introduction will 

. make the audience laugh and feel good. 

. get favorable attention and lead into subject. 

. impress the audience with the importance of the speaker. 

. present major arguments to be developed. 

. introduce a need step, thesis, and main points of speech. 
The main idea in the selection in the text from Steinbeck’s The Grapes of Wrath 
is to 

A. show the conflict between those who own the land and those who care for it. 
В. tell the story of a farmer who became a day laborer. 

C. deseribe a family of poverty-stricken children. 

D. explain the importance of rotation of crops. 

E. urge tenants to pay their taxes. Mni 

In order to avoid stage fright, perhaps the best precaution is 

A. to be thoroughly prepared. — 

B. to write your speech and read it. — Я . 
C. to display a "don't care anyhow” attitude to the audience, 
D. to display an overconfident, overbearing attitude, 

E. to avoid looking directly at your audience. 


рор 


= 
e 


gow»pHEDOUPERUOUP 


вое 


446 APPENDICES 


28. The round table method of public discussion is suitable for groups of not more 


33. 


51. 


86. 


92. 


81. 


Е. 5. 


, One of the following is a run-on sentence. That sentence is: 


A. Dinner was served, and we ate rapidly. 

B. Some of the people are waiting, others have gone ahead. 

C. This is the problem; the solution is clear. 

D. He paused, adjusted his tie, and rang the bell; the maid refused to open 
the door. 

E. Whistles, sirens, horns, and firecrackers broke the silence, and the New 
Year was born. 

Barnes emphasises the Four Fundamental Processes of Speech. Of these the 

first is : 

A. phonation. 

B. adjustment to the speaking situation. 

C. choice of material. 

D. control of bodily activity. 

E. projection to the audience. 


. Best advice in developing a lively sense of communication is 


A. be energetic, speak with enthusiasm. 

B. be passive, apathetic. 

C. appear reluctant to meet the assignment. 

D. avoid looking directly at the audience. 

E. speak with an outburst of oratorical display. 

Steinbeck’s The Grapes of Wrath pictures primarily: 

. the shiftlessness of the average American farm worker. 

. the plight of the Western tenant farmers and migratory workers. h 
- the effect of Communist propaganda on poor tenant farmers. 
. the immorality of the ignorant migratory workers. 

. the lack of religious faith among American farmers. 


A boy whose fascination with a machine later turned to fear was found in 
A. Roan Stallion. 


B. Tractored Off. 

C. R. U. R. 

D. Our Changing Characteristics. 

E. Mr. Mechano. 

The Wright Brothers! home was 

. on Albermarle Sound. 

. at Fort Meyers. 

. in St. Petersburg, Florida. 

. on the coast of North Carolina. 

E. on Hawthorne Street in Dayton, Ohio. 
President Truman in his Fordham Address said the one defense against the 
atom bomb lies in 

. bigger and better fighter planes. 

. an adequate air raid warning system, 

. aggressive warfare. 

. making a stronger U. N. 

mastering the science of human relationships, 


EgouUwr»m 


cow 


moouwsr 


SIMPLIFIED ITEM-ANALYSIS MT 


14. Which sentence is best punctuated? 

A Ihave по pencil; and I do not want опе, 

B. I have no pencil, and 1 do not want one. 

C. I have no pencil: and I do not want one. 

D. I have no pencil—and I do not want one. 

E. I have no pencil: and I do not want one. 
74. Karel Capek, author of ^R. U. R.” is a 

A. Frenchman. 

B. Czechoslovakian. 

C. American. 

D. Pole. 

E. Italian. 
57. Lippman's Problem of Unbelief seeks primarily to: 
. show how peace of mind can be achieved. : 
. denounce the creeds of the leading Protestant denominations, 
. show that Christianity is a dying religion. 
` determine the causes of modern man’s lack of religious faith. 
. improve the morals of American youth. 


woars 


Complete option information for high and low group responses to these 25 poorly 
discriminating items is provided in Table 45, where bold-faced type indicates the 
number of persons marking the keyed option. 

The 20 speech items (Numbers 26-45) make up only 20 per cent of the entire 
test, yet 14 of these (70 per cent) appear among the 25 least discriminating items. 
The first 25 items in the test cover spelling, grammar, and punctuation. Only 2 of 
these (8 per cent) appear in Table 45. The last 55 items deal with literature; 9 of 
these (16 per cent) were poor discriminators. It is obvious, then, that from the 
percentage standpoint about 5 times as many speech items as non-speech ones seem 
unsatisfactory (70 versus 14 per cent). 3 н 

There are several possible explanations for the poor showing of the speech items. 
First of all, they make up only a fifth of the test and therefore do not have much 
weight in determining each testee's total score. If speech ability is little related to 
the other components of the test, then no matter how carefully prepared the 


construct, while the more standard phases of English are easier to test. Some of them 
(Numbers 35, 38, 42, 32, 29, 45, and 44) are too easy and cannot discriminate well. 
Only 2 (Numbers 34 and 33) are near the optimal difficulty level of 50 per cent. 

A third possibility, related to the second, is that the speech specialist was a less 
competent item writer than the other three staff members, or that he exercised 
less care than they. All four persons were experienced teachers, but novices in 
constructing objective questions. 


E. A Discrimination Table 

The process of item analysis can be simplified by reference to Table 46. Knowing 
the number of persons tested, р the pr 
high or low group. Then for the number of options each item in the test or subtest 
has, find the minimum value of W, — Wa needed in order to conchide that the 
item has significant discriminating power. In the above example, where N was 243, 
n is seen to be 66, and the minimum Ма 


448 APPENDICES 


TABLE 46 


Taste ron DETERMINING WHETHER OR Мот д Given Test [rem DISCRIMINATES 
SraNirICANTLY BETWEEN a “Hien” AND A “Low” Своор* 


(Wz = number of persons in the low group who answered the item incorrectly or 
omitted it; Wz = number in the high group who answered the item incorrectly or 
omitted it.) 


—====—— 


(Wr — Wu) at or above Which an Пет Can Be 
Considered Sufficiently Discriminating 


Total Number ies os M Number of Options 
of Persons | Group (0.27N) 2 8 4 5 
Tested (N) (Ni = Мн = n) | (True-False or 
Two-Option 
Multiple 
Choice) 
PE 8 4 5 5 5 
32- 35 9 5 5 5 5 
36- 38 10 5 5 5 5 
39- 42 11 5 5 5 5 
43- 46 12 5 5 6 6 
47- 49 13 5 6 6 6 
50- 53 14 5 6 6 6 
54- 57 15 6 6 6 6 
58- 61 16 6 6 6 6 
62- 64 17 6 6 6 7 
65- 68 18 6 6 ti 7 
69- 72 19 6 7 7 7 
73- 75 20 6 7 7 7 
76- 79 21 6 7 7 7 
80- 83 22 7 7 7 7 
84- 86 23 7 7 7 7 
87- 90 24 7 7 8 8 
91- 94 25 7 7 8 8 
95- 98 26 7 8 8 8 
99- 101 27 7 8 8 8 
102- 105 28 7 8 8 8 
106- 109 29 7 8 8 8 
110- 112 30 7 8 8 8 
113- 116 31 8 8 8 8 
117- 120 32 8 8 9 9 
121- 124 33 8 8 9 9 
125- 127 34 8 9 9 9 
128- 131 35 8 9 9 9 
132- 135 36 8 9 9 9 
136- 138 37 8 9 9 9 
139- 142 38 8 9 9 9 
143- 146 39 8 9 9 9 
147- 149 40 oe 9 9 10 
150- 153 41 9 9 10 10 
154- 157 42 9 9 10 10 
9 
9 
9 


SIMPLIFIED ITEM-ANALYSIS 449 
TABLE 46 (Continued) 


(Wi — Wx) at or above Which an Item Can Ве 
Considered Su ficiently Discriminating 


Number in Number of Options 


Total Number Low or High 


of Persons Grow 
B p (0.27N) 
Tested (N) (Ni = Мн = n) | (True-False or 


169- 172 
173- 175 
176- 179 
180- 183 
184- 187 
188- 190 
191- 194 
195- 198 
199- 201 
202- 205 
206- 209 
210- 212 
213- 216 
217- 220 
221- 224 
225- 227 
228- 231 
232- 235 
236- 238 
239- 242 
243- 246 
247- 249 
250- 253 
254— 257 
258- 261 
262- 264 
265- 268 
269- 272 
273- 275 
276- 279 
280- 283 
284- 287 
288- 290 
291- 294 
295- 298 
299- 301 
302- 305 
306- 309 
310- 312 
313- 316 
317- 320 
321- 324 
325- 327 
328- 331 
332- 335 90 


450 APPENDICES 
TABLE 46 (Continued) 


(Wi — Wu) at or above Which an Item Can Be 


Considered Su ficiently Discriminating 
Total Number jp a Number of Options 
of Persone — | Group (027N) 2 3 4 5 
Tested (N) (Ni = Мн =n) | (True-False or 
Two-Option 
Multiple 
Choice) 
336- 338 91 12 13 14 14 
339- 342 92 12 13 14 14 
343- 346 93 13 13 14 14 
347- 349 94 13 14 14 14 
350- 353 95 13 14 14 14 
354— 357 96 13 14 14 14 
358- 361 97 13 14 14 14 
362- 364 98 13 14 14 14 
365- 368 99 13 14 14 15 
369- 372 100 13 14 14 15 
406- 409 110 14 15 15 15 
443- 446 120 14 15 16 16 
480 —483 130 15 16 16 16 
517- 520 140 15 16 17 17 
554- 557 150 16 17 17 18 
591- 594 160 16 18 18 18 
628- 631 170 17 18 19 19 
665- 668 180 17 19 19 19 
702- 705 190 18 19 20 20 
739- 742 200 18 19 20 20 
832- 835 225 19 21 21 21 
925- 927 250 20 22 22 23 
1017-1020 275 21 23 23 24 
1110-1112 300 22 24 24 25 
1480-1483 400 25 27 28 28 
1850-1853 500 28 30 31 31 
3702-3705 43 44 44 


* Values for this 23 per cent level-of-significance table, which is based upon Stanley’s 
2 per cent-level table (see American Psychologist, 6: 369, July, 1951), were computed 
by Miss Ellen V. Piers. 


five-option items is 124. Purely by accident, this makes the same number of items, 


“This is 4g = 18 per cent of the maximum possible Wi — Wy for an n of 66. If n 
were 200, a difference of only 20—just 10 per cent of the maximum—would be needed 
tor a five-option item to be significantly discriminating. With an n of 1,000, only 4.4 
per cent of the maximum possible difference is required for significance. Therefore, 
when n is rather small, as it usually is for teacher-made tests, a considerable number 
of items will be branded improperly as nondiscriminating. As an extreme example, take 


SIMPLIFIED ITEM-ANALYSIS 451 


25, eligible for editing as were obtained by the 25 per cent rule of thumb. 


100 x O 
2n(0 — 1) 
and the percentage difficulty value for each item computed very rapidly by using 
W: + Wz as the multiplier. By hand, the multiplication is likely to be tedious and 
inaccurate; therefore, in the absence of a machine, it is recommended that just three 
points on the difficulty scale be computed in terms of Wz + Ww: 16 per cent for 
the boundary line of a very easy item, 50 per cent for the middle-difficulty item, 
and 84 per cent for the boundary line of a very hard item. The formulas for obtain- 
ing these three “critical” Wz + Wy points are shown in Table 47. 


If a caleulating machine is available, may be locked in its keyboard 


TABLE 47 


Formutas ғов Frypina (Wz + Ин) Vatues AT Тнвев Dirricuury LEVELS 


Percentage of Testees 
Who Do Not "Know" 
the Correct Answer to 


Number of Options Each Item Has 


the Item 2 3 4 5 
16 0.160n* 0.213n 0.240n 0.256n 
50 0.500n. 0.667n 0.750n 0.800n 
84 0.840n 1.120n 1.260n 1.344n 


* п = number of examinees in the low or the high group = 27 per cent of the total 
number tested, rounded off to the nearest, whole number. 


The three Wz + Wu figures may be used to determine roughly whether the item 
is quite easy, of moderate difficulty, or quite hard. For purposes of arranging the 
revised items in order of difficulty, the various Wz + Wu values, adjusted sub- 


jectively for changes in difficulty brought about by editing, will d 
100 X 
As shown in the first footnote to Table 42 on page 438, 2n(0 —1) (0- D = 0.947 


when О = 5and n = 66. The last column of that table was obtained by multiplying 
each Wz + Wu by 0.947. } 

Ког Вуе о and 66 testees in each group (n = 66), the 16 рег cent difficulty 
point of Table 42 occurs when Wz + Wa = 0.256n = (0.256)(66) = 17. Similarly, 
the 50 per cent point is Wz + Wa = (0.800) (66) = 53. The 84 per cent point 
occurs when Wz + Wu = (1.344) (66) = 89. Therefore, in Table 42, items will be 
considered easy if Wz + Wa is 17 or less, of moderate difficulty if Wi + Инв 
from 18 to 88, and hard if Wz + Wa equals 89 or more. According to this method, 
17 of the 100 items in Table 42 are easy, while only 5 are hard. Thus the test as a 
whole is relatively easy. 


31 examinees (№ = 31) for whom the number in the high or the low group (п) is 8. By 
sheer chance marking 4 out of the 8 persons in the low group would be expected to give 


the keyed answer to a true-false item, if none of them omitted it. Thus the item could 


be deemed discriminating according to Table 46, where a difference of at least 4 is 


required, only if every person in the high group marked it correctly. Table 46 has some 


ini i i ion informa- 
value as a means of determining for how many items in à test complete option in 
tion should be tabulated and a thorough scrutiny made, but the basic rule illustrated 
in Table 42 on pages 438-440 is to edit as many items as possible, beginning with the 
least discriminating ones, 


452 APPENDICES 


F. Obtaining the Mean and the Standard Deviation 


By using the sum of the Wz + W column of the item-analysis table, it is rather 
easy to estimate the average “wrongs” score of the N testees. To secure this mean 
wrongs score. not corrected for chance, simply add the Wi + Wa column and 
divide this sum by 2n: 

А _ =(Wi + Wa) 


2n 


To correct the mean wrongs score for chance, use the following formula: 


My 


m, -Wz + Wy) 
wc 0-1 
where О is the number of options each item has. 
The standard deviation of the wrongs scores is the same as the standard deviation 
of the rights scores when omits are counted as being wrong, and this statistic, not 
corrected for chance, is easily estimated by means of the formula 


„200 Wa) 
» 2.45n 


Note the minus sign in this formula. To secure a standard deviation corrected for 


chance, multiply the above formula by ion 
iL: OZ(Wi— Wa) 
°  2485n(0— 1) 


For illustrations, turn back to Table 42, page 438. There the Wz + Wx column 
sums to 4,088, and the Wz — Wz column total is 1762. 4,088 divided by 2n 
equals 4988, or 31.0, the mean number of incorrect responses, uncorrected for 
chance, for the 100 items. | 

(5 X 4088) + (2 X 66 X 4) = 38.7, the mean wrongs corrected for chance. This 
figure agrees rather well with the mean of the right-hand (difficulty) eolumn of 
Table 42, which is 39.0. Thus on the average the correct answer to the 100 items 
was "known" by about 61 per cent of the examinees and not “known” by about 
39 per cent. 

The standard deviation not corrected for chance is 1762 + (2.45 X 66) = 10.9, 
exactly the same as when computed from all 243 total scores. Corrected for chance 
it becomes (5)(1762) + (2.45)(66)(4) = 13.6. 

These approximation formulas based upon only low and high groups are accurate 
enough for use in most school situations, particularly when the number of testees 
is fairly large, say, 100 or more. 


G. A Simplified Procedure for Obtaining a 
Reliability Coefficient 


Unfortunately, there is no fairly precise method for securing a single-form re- 
liability coefficient (‘coefficient of equivalence") without considerable computation. 
Stanley has devised two shorter procedures yielding results closely approximating 
those of the conventional methods. His simplified split-half technique has been 
reported elsewhere’ in considerable detail and will not be repeated here. Instead, 


* Julian C. Stanley, “А Simplified Method for Estimating the Split-Half Reliability 
Coefficient of a Test,” Harvard Educational Review, 21; 221-224, Fall, 1951. 


SIMPLIFIED ITEM-ANALYSIS 453 


а Kuder-Richardson Formula 20 (KR) г will be obtained from just the low and 
high group figures in Table 42, page 438, as follows: 


Kn. = ii fy AE тр 


060712007; — Ра) 
_ 100 | _ 206) (4088) — zis] 


99 0.607 (1762): 

100/, 31,986 | 100 

99 ( aud go (0-849) 
= 86, 


which is the same as the value secured by using the regular KR» formula with all 
243 cases. In general, the abbreviated procedure yields slightly lower r’s, but for 
most practical purposes this negative bias will be negligible. 

The only part of the above formula not used in computing either the mean or the 
standard deviation is 2(Wz + Ин). To get it, square each of the 100 (Wz + Wy) 
values in Table 42 and then sum them: (71)? + (3)? + (77)! + --+ + (57) = 
227,630. By hand this is a laborious process indeed, but on an electric calculating 
machine (Friden, Marchant, Monroe) it is simple to secure both 2(Wz + Wx) and 
2(Wz + Ин) т а single set of operations, In most medium-sized and large school 
systems there is a machine of this sort and someone who knows how to use it. 
Getting all three needed values for the formula and checking them should take a 
skilled operator not more than half an hour, if an item analysis table similar to. 
Table 45 has already been prepared. - Е 

This method does not involve splitting the test into halves, a tedious undertaking 
at best when a test-scoring machine is not available. The split-half coefficient of 
equivalence based upon all 243 testees is .87; it is also .87 when determined from 
only the high and low groups. KR» coefficients tend to be a little smaller than 
split-half ones,’ but again the discrepancy is usually of no practical consequence 
to the teacher. : ' 

в Lee J. Cronbach, “Coefficient Alpha and the Internal Structure of Tests," Psycho- 


metrika, 16: 297-334, September, 1951. 


APPENDIX 


C 


Scoring Rearrangement (Ranking) 
Test Items 


In many subjects, especially history, where one of the objectives is to acquire a 
sense of sequence or chronology, rearrangement questions may be a better testing 
device than other item forms. Their construction calls for considerable skill in 
putting together material so that the ranking task will not be too demanding at 
some points and too easy at others. Comparatively little has been written about 
the preparation of this type of item, though much has been said about scoring it, 
but in all likelihood there are potential uses for rearrangement items in many 
academic fields. Steps in solving a mathematical problem, sequences of equations 
in chemistry, and relative quality of various literary selections are possible illustra- 
tions. An essential condition is that “experts” can agree reasonably well as to how 
the things should be ranked, since otherwise it will not be possible to devise a 
satisfactory key. For this reason, independent keying by at least two teachers and 
reconciliation of differences before the items are uscd is desirable, but such pre- 
cautions should by no means be confined to rearrangement items. 

A chronology test item was presented in Table 19 on page 95 and scored for two 
examinees by means of the Spearman rank-difference coefficient of correlation, rho. 
A little reflection will make it obvious that the magnitude of each discrepancy 
between the student’s response and the keyed rank, not just the fact that they 
differ, is taken into account. From the historian’s point of view, it is far worse to 
think that the French Revolution occurred before the Roman Empire fell than to 
place it just prior to the destruction of the Spanish Armada. Similarly, it is 
"wronger" to rate cork heavier than white oak than to confuse the densities of 
cork and balsa. 

The apparent difficulty of scoring rearrangement items has discouraged most 
testers from using them. Actually, when a suitable table is available, the task is 
not formidable. One way to score each item is to compute the rho (p) between the 
student’s responses and the teacher’s key, as shown on page 95, and to multiply 
this rho by N, the number of things ranked. Thus the testee whose rankings on а 
certain six-option item agree completely with the key secures a rho of +1 and a 
score of 1 X 6 — 6 for the item. If his rho on another such item is 0, he gets 
0 X 6 — 0 points credit for that item. By being completely misinformed and 

454 


RANKING TEST ITEMS 455 


securing a rho of —1, he could earn (—1) X 6 = —6 points. The rearrangement 
item is just about the only type that takes misinformation explicitly into account, 

But using the formula, score = (number of things ranked) X rho, the various 
values of p are treated as if they constituted an equal-unit seale. As shown in 
Figure 4 on page $9, differences at the high end of the r scale represent greater 
changes in relationship than do differences of like magnitude between low rs. In 
an attempt to compensate for these inequalities, one of the writers (JCS) prepared 
Table 48 by means of the formula S = (number of things ranked) X rho squared 
= Nø. 

TABLE 48 


=D? TABLE ғов боовіхо REARRANGEMENT ITEMS: 
Score = (Момвев or Тніхоз Клхкео) X (Кно)* 


(Look up XD? in Appropriate Column) 


Score Score 
7 0- 2 7 
6 4- 6 6 
5 8- 10 5 
4 12- 16 4 
3 18- 22 3 
2 24- 30 2 
1 32- 40 1 
0 42- 70 0 

17, 72- 80 -1 
59 82- 88 -2 
F -3 
p 2 
-5 


To score a rearrangement item, simply obtain the ХР? “in your head" by com- 
paring the student's reponses with a conveniently arranged key. This 2D* is used 
to enter the column of the table for the proper number of things ranked, from which 
the score is read directly at either the left or the right. All =D* values for N = 2 
through N = 7 are contained in Table 48. Tables for N’s higher than 7 can be 


easily, but it does not seem wise to construct these and thereby 
omes s to “devise more complex апі less homogeneous rearrangement 
i ally desirable. 2 ў 
ren tack 19 on page 95 for two illustrations of how Table 48 is used. 
There Richard Вое’з ZD? is 40 for 6 events ranked. Looking in the 6” column of 
Table 48 we find that 40 lies within the 26-44 =D? range, for which the — is 0. 
Richard gets no points at all for his inaccurate responses to this m f 3 x 
employed the formula S = Np, he would have received 6(—.14) = —. 


point. А ў 

stee for whom ranks are listed in Table 19, had a Хр? of 6, 
eR interval of Table 48 and merits a score of 4. Had he boen 
scored by means of the Np formula, he would have obtained .83 a " A E - 
5 points. The Np? scoring method of Table 48 usually yields scores y oser à nan 
the Np procedure. Tt seems somewhat more defensible on statistical grounds, 


APPENDIX 


D 


The Computation of Square Roots 


When determining a standard deviation or an r, one must find the square roof, 
of a number. This can be done easily from a table of square roots, such as Barlow's 
Tables of Squares, Cubes, Square Roots, Cube Roots and Reciprocals (London: 
E. & F. N. Spon, Ltd.). Quite a few statistics texts and other books have shorter 
tables. If only an occasional square root is needed, it can probably be secured more 
easily "by hand" than by searching for a table. 

In Chapter 15 of Mathematics Essential for Elementary Statistics (New York: 
Henry Holt and Company, 1951), Helen M. Walker devotes 16 pages to explaining 
what square roots mean and how they are secured. The reader who is thoroughly 
in the dark concerning this topie will want to consult that reference. For others 
who need merely a little reviewing of material previously learned, the following 
brief explanation is offered. 

1. First take a small three-digit whole number that is a perfect square; 144 is a 
good illustration. The square root of 144 is a number which, when multiplied by 
itself, equals 144. To extract the square root of 144, follow these seven steps: 


@ _. (b) 7 

Атта TA. 
(с) (d) 1 

У па У 14 
— —1 

0 04 


456 


COMPUTATION OF SQUARE ROOTS 457 


(е) 1 (Э 19. 
bii. VO 
= -1 
21 0 44 2' 04 
(9) hae 
V. 1A. 
1 
2104 
- 44 
w 


(a) poe the 144 with a square root sign and two decimal points, one above the 
other. 
Begin with the decimal point following 144 and move to the left two digits at 
a time, putting a caret at each stopping place. With 144 only one move and 
therefore one caret is needed. Р 
(c) Look at the number to the left of the caret. What number multiplied by itself 
is as nearly equal to 1 as possible but not greater than 1? 1, of course, so write 
this 1 above the 1 and below it. Subtract. 
(d) Draw down the next two numbers. 
(e) Double the top 1 and write it to the left of 0 44. 
Now, how many times does 2 go into 0 4? 2, so write 2 in the answer space above 
the right-hand 4 and also to the right of the 2. 
(g) Multiply the 22 by 2, write this product (44) below the other 44, and subtract. 
Therefore, the square root of 144 is exactly 12, since 12 X 12 = 144. 
2. Take a large decimal fraction, 9342.156, and find its square root to the nearest 


two decimal places. 


(b 


5 


(a) i gioi gni 2 

98 , 42.15 6000 Vv ae 15, 60,004 
12 

(с) ges (d) (9 6. dt 

vV 98, 42.15 00, 004 М 798, 42.15, 60,00. 
—81 —81 

18} 24 186} 12 42 
Bj me =11 16 


1 26 


458 APPENDICES 


(e) 9 66 (f) _9 воза и 
У 9342156000, У 9342.15 „00,001, 
- —81 
186 | 12 42 186 | 12 42 
180] 12 16 ^ -H.16 
1926; 1 2615 1926 | 1 2615 
ud —1 1556 ^ —l 1556 
10 59 19325 | 1059 60 
—"9 66 25 
193304 | 93 35 00 
—77 32 16 
16 02 84 


(a) First write down the number with the square root sign, carets, and a decimal 
point in the answer place. (Notice that this decimal point is always exactly 
above the decimal point in the number.) Begin at the decimal point in the | 
number and count in both directions by two's, putting a caret between each 
pair. Zeros are added to the right of the decimal point beyond the last figure 
in order to have the two numbers to draw down each time. In order to carry 
out the square root to the nearest two decimal places (rounded off from three 
places), it is necessary to have six figures to the right of the decimal point. 

(b) What number multiplied by itself is as nearly equal to 93 as possible, without 
exceeding it? 10 X 10 = 100, which is too much. 9 X 9 = 81, so use 9 as the 
first number in the square root. Multiply it by itself and subtract the 81 from 93. 

(c) Draw down the next two numbers (42), double 9, and write 18 to the left of 
12 42. 

(d) Approximately how many times will 18 go into 124? Not quite 7, for 18 X 7 
= 126. Try the next lower number, 6. Write it in the answer space above the 
2 and also to the right of the 18. Multiply 6 by 186 and subtract this product, 
11 16, from 12 42, 

(e) Draw down the next two numbers and double the 96. 19 goes into 126 about 
6 times. Repeat the above process, 

(f) Double 966, write 1932 in the proper place, and complete the remaining steps. 

The square root of 9342.156 is 96.654 +, which when rounded off to the nearest 

two decimal places becomes 96.65. Where test scores are concerned, only one 

decimal place is usually needed for the standard deviation. Also, in the denominator 
of the r formula, on page 92, extraction of the square roots to the nearest, three 
figures is sufficient for most purposes. 

To check the computation of a square root, multiply the value obtained by itself 
and add to this produet the remainder. For example, (96.654 X 96.654) + .160284 
= 9342.156000, which agrees exactly with the figure underneath the square root 
sign in Step (a) on page 457, 


L 
2. 


e 


APPENDIX 


E 


Answers to Questions in Appendix A 


С. 
р. 


i 12. 
. 4 of 50 is 12.5. Count up 12.5 frequencies. 239.5 + | ^ $ 


“The sum of f" is the total frequency, №. 

For example, 310-319 means 310, 311, 312. 313, 314, 315, 316, 317, 31%, 
and 319—a total of 10 numbers. Likewise, the difference between the 
upper and lower "real" class limits = (319 + 0.5) — (310 — 0.5) = 
319.5 — 309.5 = 10. 

See above. 


. (260 + 269)/2 = 529/2 = 264.5. Likewise, 259.5 + (10/2) = 269.5 — 


(10/2) = 264.5. 


. The assumed mean lies at the midpoint of the class, 260-269, whose d is 0. 


Thus M = (260 + 269/2 + [10 X (—15))/90 = 264.5 + (—150)/50 = 
264.5 — 3 = 261.5. 


. 50/2 = 25. 250.5 + (25%) (10) = 259.5 + (10/12) = 200.3. Simi- 


— 14 
larly, counting from the top down аз a check, 209.5 — (eoe ) (10) = 
269.5 — (110/12) = 269.5 — 9.2 = 260.3. 


. The mode is the midpoint of the class having the greatest f. 12 is the 


largest figure in the f column, and the midpoint of its class is (260 + 269)/2 
= 264.5. 

M5 — 5 (10) = 
239.5 + (75/8) = 239.5 + 9.4 = 248.9. Or, 2 of 50 = 150/4 = 37.5. Count- 


ing down, 249.5 — (9) (10) = 249.5 — (@) = 249.5 — 06 = 2489. 


The 75th percentile is $ of the way from the bottom of the distribution and 
i of the way from the top. 1 of 50 is 12.5. Count down 12.5 frequencies. 


279.5 — Gr (10). = 279.5 — (45/6) = 279.5 — 7.5 = 272.0. To 


р 5 37.5 — 36 b 
check, count up 37.5 frequencies. 269.5 + (етт ) (10) = 269.5 + 


(15/6) = 269.5 + 2.5 = 272.0. 
459 


460 


APPENDICES 


10. В. 0, is the 75th percentile, found to be 272.0 in Question 9, above, and Qı 


п. 


18. 


14. 


ев» 


‚ С. 


e 


is the 25th percentile, 248.9 in Question 8. (272.0 — 248.9)/2 — (23.1)/2 
- 11.55 — 12. 


« 1% of 50 is 5. 239.5 + E) (10) = 239.5. Counting down, +% of 50 is 45. 


239.5 — (2) (10) = 239.5. 


50 — 0.9(50) = 5. Count down 5, or count up 45. Obviously, it is easier 
to count down 5. Counting up is useful as a check, though. 299.5 — 
(5 2 ?) (10) — 299.5 — (20/4) — 294.5. Thus the 90th percentile is the 
midpoint of the 290-299 class; it lies exactly halfway within that class, 
5 units below the upper real limit and 5 units above the lower real limit. 


Check by counting up: 289.5 + pm (10) = 289.5 + 5 = 294.5. 


. Use the answers to Questions 12 and 11 in solving this. 0.4(294.5 — 239.5) 


= 0.4(55) = 22.0. 
V50(239) = (—15)? _ V/11,950 = 225 11,725 
10x —— л = e - = 
50 5 5 
108 
= = =216=22 
19.5 — 9.5 = 10, 


99.5 + 3(109.5 — 99.5) = 99.5 + 4(10) = 104.5. To check: 109.5 — 
3(109.5 — 99.5) = 109.5 — 5 = 104.5. 


- There are two essentially different reasons for grouping scores: computa- 


tional (to reduce labor) and graphical (to emphasize important features of 
the data). One of the best discussions of the latter aspect is contained in 
Truman L. Kelley's Fundamentals of Statistics, Chapter IV, “Graphic 
Methods." Cambridge, Massachusetts: Harvard University Press, 1947. 


- It is necessary to know the range before performing the operations set forth 


in Options B, C, D, and E. 


. For instance, into which of the two classes, 44-48 and 40-44, would you 


put a score of 44? 


- The 60th percentile is defined as the point in a distribution below which lie 


60 per cent of the scores and above which lie 40 per cent of the scores. 


- When arranged in numerical order, these scores are: 4, 4. 5, 6, 7. The 


middle score (midscore) is 5. 


- In numerical order these scores are 44, 46, 46, 46, 63, 68, 68; their mid- 


score is 46, which has three numbers on each side of it. A frequency distri- 
bution of these scores with an interval of 1 is as follows: 


Score ii 
68 2 
63 1 
46 3 
44 1 


The median of this distribution is found by counting half the way up or 
down the frequency column. The total frequency is 7, and half of 7 is 3.5. 
Counting up through the 43.5-44.5 class uses only 1 frequency, but count- 


23. 
24. 


to в 
сл 
on 


29. 


30, 


ANSWERS TO QUESTIONS 461 


ing through the next (45.5-46.5) class involves 1 + 3 = 4 frequencies, 
more than the 3.5 required to locate the median. Thus the median is 


45.5 + te (1) = 45.5 + (2.5)/3 = 45.5 + 0.83 = 46:33, To check: 


465 — (= = ?) (1) = 46.5 — (0.5/3) = 46.5 — 0.17 = 46.33. There- 


fore, the discrepancy between the median and the midscore in this distri- 
bution is 46.33 — 46 = 0.33, which illustrates the fact that the midscore 
and the median of a distribution may have different values. Usually the 
difference is slight, however. 


. (4+5 +7 +6 +4)/5 = 26/5 = 5.2. 
. Only two of the five options (C and E) contain measures of central tendency ; 


the standard deviation, quartile deviation, and range are measures of vari- 
ability. Since the arithmetic mean is a function of every score in the distri- 
bution, its value would reflect “the undue influence of a few extreme 
salaries.” Whether the highest paid worker made $5000 or $50,000 is wholly 
inconsequential so far as the size of the median is concerned. 


] 5 — (—0.5 7х0 3X1 
m- mw 008 ggg BS COS 8х0 xIx t 


7+3 


= 15+ - 15412 = 27. 


. It shows the range of the middle 80 рег cent of the scores in the distri- 


bution. 


. Qs is the 75th percentile, and the median is the 50th percentile. Twenty-five 


per cent: of the scores lie between these two points, since 75 — 50 = 25. 


Score Я d fd jë 
4 3 2 "TE. 12 
3 0 1 0 0 
2 4 0 0 0 
1 0 —1 0 0 
0 3 —2 —6 12 
N-10 zf-0 ze = 
spi улу СМ ix У1000) —0 
r N 10 
X2 He edu: 
10 10 
10 08D 
2 40.00 
1 
25| 1 40 
) Ds 
305| 1500 
1525 


462 APPENDICES 


31. 


32. 


33. 


35. 


36. 
37. 
38. 


39. 


40. 


41. 


42. 


The size of the interval of each class is 1, the classes 2.5-3.5 and 0.5-1.5 
having frequencies of 0. For convenience, the assumed mean, M’, was set 
at 2, the midpoint of the 1.5-2.5 class, which is the center of the distribution. 
It may be put anywhere else, of course, without altering the answer. 


D. Q = 75th percentile ; 25th percentile 
—3 | | P: G - *) | 
_[ss- n )o — | 1.5 + НЕ (1) 
2 


2 2 2 


С. 254 SS (1) = 2.5 + 0.75 = 3.25. 


Check: 3.5 — (è A ‘) (1) = 3.5 — 0.25 = 3.25, 
B У к 
"um SD 017909 Priv Rages 


Therefore, this individual is half a standard deviation below the mean of 
the group with which he was tested. 


. E. Three tied scores of 95 occur. If there were no ties, these three places would 


have ranks of 4, 5, and 6. Since one score of 95 is as good as another, we 
assign the average of 4, 5, and 6—which is 5—to each of the three scores. 
Note that 4 + 5 4 6 = 15, the same as the sum of the new ranks: 
5+ 54-5 = 15. Whether or not ties oceur, the sum of a certain number 
(N) of consecutive ranks beginning with 1 will always be [N(N + 1)]/2. 
If there are 9 ranks, as in this question, their sum will be (9 X 10)/2 — 45. 
B. As noted above with reference to Question 34, the method of assigning to 
tied scores the average rank that would have occurred without ties keeps 
the sum of untied and tied sets of ranks of the same length identical. Since 
the sum is unchanged, the mean, which is the sum divided by N, the number 
N+1 
2 


of ranks, is also unchanged. It will always be 


B. See Chapter 3. 

A. 

С. An r of 0 has the least possible predictive value. The closer to 0 r gets, 
regardless of sign, the poorer prediction becomes. 

E. The r discussed in Chapter 3 simply cannot be greater than +1.00 or 
— 1.00 except when computational errors are made. 

A. Each tally mark represents a pair of scores. There are № pairs of scores in 
all. The number of cells in a 12 X 12 scatter diagram is 144, but some of 
these will probably be blank, while others will have more than one tally. 
See Table 18, page 92, which has 15 X 14 = 210 cells, 33 of which contain 
the N = 43 tallies. 210 — 33 = 177 of the cells are empty. 

B. The arithmetic mean is a measure of central tendency; the standard devi- 
ation is a measure of variability. 

С. Q; is the 75th percentile; the median (Q») is the 50th percentile. 


45. 


AT. 


49. 


E. 


я 


ANSWERS TO QUESTIONS 463 


Both the mean and the standard deviation are based upon all scores in the 
distribution; the median and Q are both percentile measures. Also, the mean 
is used with the SD, while the median is used with Q. The analogy is: A 
certain kind of measure of central tendency is to a similar sort of measure 
of variability as another kind of measure of central tendency is to a similar 
sort of measure of variability. 


. A frequency distribution has a median but usually no midseore, while un- 


grouped measures have a midseore (the middle score, if the number of 
scores is odd, or the average of the two middle scores, if the number is even). 


. 50 per cent of all the measures in a distribution always lie between Qs and Q,. 


In a normal (so-called "bell-shaped") distribution, 68 per cent of all cases 
lie within one standard deviation of the mean. 


. The arithmetic mean is the most reliable measure of central tendency, the 


mode the least reliable; the standard deviation is the most reliable measure 
of variability, the range the least reliable. 


. The standard deviation is a linear distance along the base line of a frequency 


distribution. 


. When correlation is positive, high scores on one test tend to go with high 


scores on the other test, while low scores tend to go with low scores. This is 
a direct relationship. When correlation is negative, high scores on one test 
go with low scores on the other, and vice versa. This is an inverse rela- 
tionship. 


. Spearman derived the formula for the rank-difference coefficient of cor- 


relation, rho, while somewhat earlier Pearson had derived the basic formula 
for r. 


Ranks denote order only. We do not know how high or low a score the 
person who obtained a certain rank may have had. Scores tell how many 


points the testee earned—that is, the magnitude of his achievement. 


APPENDIX 


F 


Publishers of Standardized Tests 


The following list includes every test company for whom five or more tests are 
indexed on pages 1100-1106 of Oscar K. Buros (Editor), The Fourth Mental 
Measurements Yearbook. Highland Park, New Jersey: Gryphon Press, 1953. 

The number of tests covered in the 1953 yearbook is shown in bold-face type 
following each address. An asterisk (*) preceding the name indicates that the 
company issues ‘‘catalogues devoted entirely or in large part to tests" (Buros. 
page 1100). 

*Acorn Publishing Co., Inc., Rockville Centre, New York (24) 

“Australian Council for Educational Research, 147 Collins St., Melbourne, С.1, 
Australia (15) 

Benton, Review Publishing Co., Inc., Fowler, Indiana (11) 

*Bureau of Educational Measurements, Kansas State Teachers College of Emporia, 
Emporia, Kansas (19) 

*Bureau of Educational Research and Service, State University of Iowa, Iowa City, 
Towa (14) 

“Bureau of Publications, Teachers College, Columbia University, New York 27, 
New York (11) 

*California Test Bureau, 5916 Hollywood Blvd., Los Angeles 28, California (29) 

SY for Psychological Service, George Washington University, Washington, D. C. 

5 


College Entrance Examination Board, 425 W. 117th Street, New York 27, New 
York (18) 


*Cooperative Tesi Division, Educational Testing Service, Princeton, New Jersey (59) 
Division of Educational Reference, Purdue University, Lafayette, Indiana (6) 
Educational. Records Bureau, 21 Audubon Ave., New York 32, New York (10) 

"Educational Test Bureau, Educational Publishers, Inc., 720 Washington Ave., $. E., 

Minneapolis, Minnesota (30) 
Educational Testing Service, Princeton, New Jersey (49) 
464 


PUBLISHERS OF STANDARDIZED TESTS 465 


*C. A. Gregory Co., 345 Calhoun St., Cincinnati 19, Ohio (5) 
*George G. Harrap & Co., Ltd., 182 High Holborn, London W.C.1, England (11) 
*Houghton Mifflin Co., 2 Park St., Boston 7, Massachusetts (14) 
Joint Committee on Tests, 132 W. Chelton Ave., Philadelphia 44, Pennsylvania (7) 
oos Б. of Nursing Education, Inc., 2 Park Ave., New York 16, New 
ог 
*Ohio Scholarship Tests, Ohio State Department of Education, Columbus, Ohio (29) 
Personnel Research Institute, Western Reserve University, Cleveland, Ohio (9) 
*Psychological Corporation, 522 Fifth Ave., New York 18, New York (54) 
P nt Center Press, 1275 New Hampshire Ave., N. W., Washington 
6, D. C. (6; 
Psychometric Affiliates, Box 1625, Chicago 90, Illinois (5) 
TE School Publishing Company, 509-513 North East St., Bloomington, Illinois 
12 1 
*Seience Research Associates, Inc., 57 West Grand Ave., Chicago 10, Illinois (37) 
*Sheridan Supply Co., P. O. Box 837, Beverly Hills, California (9) 
Turner E. Smith & Co., 441 West Peachtree St., N. E., Atlanta 3, Georgia (8) 
*Stanford University Press, Stanford, California (13) 
Siate High School Testing Service for Indiana, Purdue University, Lafayette, 
Indiana (33) 
Steck Co., Austin 1, Texas (6) 
*C. H. Stoelting Co., 424 North Homan Ave., Chicago 24, Illinois (5) 
* University of London Press, Ltd., Little Paul's House, Warwiek Square, London 
Е.С.4, England (12) p 
*Vocational Guidance Centre, 371 Bloor St., W., Toronto 5, Canada (12) 
*Worid Book Company, 313 Park Hill Ave., Yonkers-on-Hudson 5, New York (47) 


Author Index 


Adams, Eunice, 204 

Adams, Jessie E., 363 

Adkins, Dorothy C., 55, 162, 191, 245 
Allen, Mildred M., 225 

Allport, Gordon W., 49, 176, 190, 244, 421 
Anastasi, Anne, 219, 348 

Anderson, C. J., 305 

Anderson, Gordon V., 371 

Anderson, Harold A., 195, 288 
Anderson, Scarvia B., 152, 326 
Arbuckle, Dugald S., 371 

Aristotle, 8, 9, 14, 20 

Arkin, Hubert, 273 

Arnold, Dwight L., 143 

Arthur, Grace, 422 

Ashbaugh, E. J., 41, 123 

Ashburn, Robert R., 193 

Asher, E. J., 277 

Avent, Joseph E., 139 

Ayres, Leonard P., 38, 39, 45, 115 


Bain, Alexander, 52 

Baker, Arthur O., 168 

Baker, Harry J., 330, 333, 349. 

Bamberger, Sister Clara Francis, 254 

Barnes, Harry Elmer, 10 

Barr, Arvil $. 15, 21, 25, 127, 378, 421 

Barrett, E. R., 182 

Barton, W. A., Jr., 174 

Bass, Bernard M., 421 

Bayless, Ernest E., 174 

Beard, Charles A., 4 

Beauchamp. W. L., 168, 171, 177 

Bebell, Clifford, 366 

Beck, Roland L., 171 

Bedell, Ralph C., 174 

Beggs, V. L., 407 

Benjamin, A. Cornelius, 14 

Bennett, George K., 134, 245, 371, 419 

Benson, Arthur L., 398 

Berdie, Ralph F., 219 

Berkshire, James R., 365 

Betts, Emmett Albert, 345 

Betts, Gilbert L., 182 

Billett, Roy O., 348, 353, 354, 358, 362 

Billings, Josh, 370 

Binet, Alfred, 30, 31, 32, 33, 34, 35, 48, 51, 
52, 53, 55, 57, 108, 109, 115, 127, 215, 
227, 229, 280, 282, 284, 290, 351, 418 


Bingham, Walter V., 350 

Bixler, Harold H., 130 

Black, Max, 14 

Blair, Glenn Myers, 331, 334, 345 

Bledsoe, Joseph C., 371 

Blin, 32 

Blommers, Paul, 22 

Bobbitt, Joseph M., 341 

Book, William F., 321 

Bordin, Edward, 144 

Borgersrode, Fred von, 221-224 

Boring, Edwin G., 6, 15, 25, 32 

Boss, Mabel E., 406 

Bowles, Frank H., 194 

Boyd, Gertrude, 345 

Boyer, Phillip A., 213, 361, 406 

Boynton, Paul L., 59, 229, 281 

Bràndenburg, G. C., 49 

Brenner, Benjamin, 325, 326 

Bridges, Claude F., 246 

Brinton, W. C., 247, 271, 273 

Brown, Clara M., 183 

Brown, Edwin J., 402 

Brown, F. W., 187 

Brown, Francis J., 316, 317 

Brown, G. L., 284 

Brown, Judson S., 327 

Brown. William, 123, 124 

Brown, Woodrow A., 246 

Browne, Arthur D., 398 

Brownell, William A., 14, 126, 134, 142. 
167, 224. 329, 341 

Brubacher, John §., 15, 25 

Brueckner, Leo J., 126, 336 

Bruner, Herbert B., 388 

Buckingham, B. R., 15, 43, 44 

Burbank, Luther, 299 

Burnham, Paul Sylvester, 371 

Burns, Bob, 150 

Burnside, Carolyn J., 246 

Buros, Oscar, 54, 55, 120, 121, 134, 191, 
219, 245, 464 

Burt, Cyril, 31 

Buswell, Guy T., 340 


Cadwell, Dorothy H. B., 245 
Cady, 49 

Caldwell, Otis W., 29, 38 
Caldwell. V. V., 354, 355 


467 


468 


Calvin, Allen, 51 
Cameron, Dale C., 341 
Camp, 309 
Campbell, Pera, 340 
Cane, V. R., 327 
Cantril, Hadley, 52 
Carmichael, Leonard, 47 
Carroll, John B., 135, 219 
Carter, Ralph i^e 198, 200 
Cattell, J. McKeen, 30, 33, 52, 115 
Cattell, Psyche, 284, 285, 288 
Cattell, R. B., 52 
Chadwick, E., 38 
Chalmers, T. M., 314 
Chambers, E. G; 104 
Charters, W. W., 337 
Chase, W. P., 326 
Chauncey, Henry, 300, 371 
Chave, Ernest J., 59 
hilders, Leon М. 41 
Clark, Champ, 19. 
Clark, Edward L., 124, 348 
Clark, Willis W., 175 
Cobb, E. B., 245, 371 
Cobb; Irvin В. 19 
Cochran, Roy E., ru 196, 203, 204 
Cohen, Т. Bernard, 
Cole, Robert D., 55 224 
Coleman, William, 245, 371 
Colton, Raymond E 273 
Compton, R. K., 321 
Conant, James Bryant, 5 
Conklin, Edmund S., 352 
Conneau, A., 163 
Conner, 310° 
Conrad, Herbert 5., M 52, 120, 297 
Cook, Stuart W, 25, 5 8, 425 
P ee W. 59, тва 206, 300, 327, 


65 
Coombs, Clyde H., 424 
Cooper, р. H., 398 
Cornell, Ethel L., 359 
Cornell; Francis er 304, 395, 396, 398 
Cornog, J., 168 
Courtis, Stuart A, Pk 38, 134, 219 
Cowdery, Karl м; 
Crawford, Albert amid 371 
Crawford, John R., 291 
Crespi, Leo P. ae 
C Paul F., 2 
Cronbach, Mes 23, 55, 124, 154, 162, 
180, 245, 300, 345, 416, 423, 4 
Crow, Alice, 365 
Crow, Lester D., 365 
Crowder, Norman A., 418 
Gorton, Edward E., 135, 298, 417, 420, 


5 
Curtis, F. D., 174 


Dailey, John T., 398 

Daly, Joseph F., I 
Danielson, Paul. J., 

Darley, John G., dr an, 415 
Darling, W. C., 174 

Darwin, Charles, 8 


AUTHOR INDEX 


Davenport, K. S., 116 

Davis, Allison, 273, 422 

Davis, Frederick B., 22, 25, 55, 120, 134, 
135, 154, 157, 160, 162, 371, 422, 440 

Davis, Georgia, 341, 342 

Davis, R., 176 

Davis, Robert A., 25, 139, 164, 197, 348, 


Deming, W. Edwards, 60 

Deputy, E. C.. 317, the 

Deutsch, Morton, 35, 51, 398, 425 
Dewey, John, 4, il, 25, 357 
Diamond, Leon №. 1 

Diederich, Paul B., 238 

Dixon, Wilfred J, 105 

Dolbear, Amos E. , 352 

Domas, Simeon J., 398,.418 
Doppelt, Jerome E., 418, 425 
Doughton, Isaac, 16. 

Douglass, Harl R., 

Dressel, Раш L., ix, а, 140, 345 
DuBois, Philip H., 134 

Dunlap, Jack W., 234 
Dunlap, Knight, 37 

Dunn, Leslie C., 8 
Durant, Will, 16 

Durost, Walter N., 246, 300 
Durrell, Donald D., 334 
Dyer, Henry Ds 326 
Dykema, Karl W., 119 


Eaton, Merrill T., 21 
Ebbinghaus, Hermann, 30, 31, 34 
Ebel, Robert L., 191 

Edmiston, R. W., 201 

Edwards, Allen i 105 
Edwards, I. Newton, 34T 

Eells, Kenneth W., 278, 422 
Eells, Walter Crosby, 414 
Einstein, Albert, 7, 25 

Elliott, Edward C., 40 

Ellis, Albert, 51, 206 

Ellis, Robert 8., 365 

Ellis, Robert W., 398 

Ellwood, Charles А., 10 
Elsbree, Willard S., 108, 415 
Elwell, Mary, 126, 336 

Engel, Thelburn D 363 
Engelhardt, М. L., Jr. 397 
Engelhardt, N. L., Sr., 395 
Engelhart, Max D., 49, 134, 165 
Erickson, Clifford Е, 371 
Eurich, Alvin C., 165, en 377, 398 
Evans, Robert 0., 407, 4 

Eysenck, Hans J., 52 

Ezell, L. B., 191 


Fabre, Jean i 9 

Falls, J. D., 

Fan, Chung-Teh, 134 

Farley, Belmont Mercer, 402, 403 
Fechner, Gustav T., 31 
Ferguson, George A., 125 
Fernald, , 47 

Fernald, Grace M., 345 

Ferrell, ‘Guy V. 88 


AUTHOR INDEX 


Fiala, Silvio, 7 

Finch, Frank H., 182, 258 

Findley, Warren G., 134 

Fish, Louis J., 405 

Fisher, George, 38 

Fisher, Ronna 8, 88 

Flanagan, Jo! ., 52, 134, 146, 204, 
245, 274, 288, 300 ' E 

Flesher, Marie A., 364 

Foley, J. D., 372 

Foley, John P., Jr., 348 

Folk, S. B., 364 

Forlano, George, 316 

Fowler, Fred M., 369 

Franzen, Raymond, 298 

Frederiksen, Norman, 246, 300, 371 

Freeman, Frank N., 59 

Freeman, Frank В., 55, 135, 352 

Froelich, Clifford P., 371, 415 

Furfey, Paul H., 102 : 


Gable, Sister Felicita, 310 

Gage, N. L., 

Galen, 47 

Galileo, 4, 6, 7, 11 

Gallup, George H., 52 

Galton, Sir Francis, 8, 30, 31, 33, 38, 40, 
48, 49, 52, 85, 86, 109, 115, 308, 348 

Gardner, Eric F., 59, 162, 300 

Garrett, Harley F., 86 

Garrett, Henry E., 105, 290 

Gates, Arthur I., 182, 357 

Gerberich, Raymond J., 22 

Gilbert, Arthur W., 365 

Gilehrist, Robert S., 415 

Gillespie, F. H., 49, 

Gilliland, A. R., 348, 422 

Glaser, Maynard, 423 

Goddard, Henry H., 33 

Goheen, Howard W., 55, 162 

Goldenweiser, Alexander, 9, 10 

Good, Carter V., 15, 21 127 

Goodenough, Florence Ti; 33, 55, 162, 191 

Goossen, Carl V., 422 

Gordon, Hans C., 22, 406 

Gordon, W. E., 195 

Grambs, Jean E., 414 

Gray, J. Stanley, 14 

Green, Bert F., Jr., 206 

Greenberg, Jacob, 183, 188 

Greene, H. A., 176, 177 

Gregory, C. A., 171, 173 

Grimes, James W., 144 

Grossnickle, Foster E., 126 

Guiler, Walter Scribner, 330, 881 

Guilford, J. P., 105, 123, 135, 144 

Gulliksen, Harold, 25, 55, 134, 135 


Hagen, Elizabeth P., 418 
Haggard, Ernest A., 278 
Haggerty, Lida Harmar, 2 
Haggerty, M. E., 382 

Haines, Millicent, 60 

Hall, G. Stanley, 49, 50, 52, 404 
Hall, Wilbur, 299 

Hangen, 243 


469 


Harris, Albert J., 294 
Harry, David P., Jr., 168, 189 
Hart, 49 

H., 406 


Hartshorne, , 47, 48 
Ha E 


rtung, Maurice, 327 
Hawkes, Herbert E., 150, 156, 165, 180, 
185, 186, 197 
Hawkinson, Mabel J., 336 
Healy, William, 35 
Heil, Walter G., 110, 111, 286, 287 
Heim, Alice W., 327 
Heimann, Robert A., 371, 422 
Heinis, H., 288 
Heisenberg, W., 132 
Helmholtz, Hermann, 8, 31 
Henry, Lyle K., 311 
Henry, Nelson B., 191, 206 
Hensley, Iven H., 139, 164, 197 
James B., 1 


д 21 
Hevner, Kate, 174 
Hildreth, Gertrude H., 54, 58, 130, 288, 
333, 337, 338, 345, 366 
ilgard, Ernest R., 327, 373 
Hilkert, Robert N., 245 
Hill, Seem E., 175, 407, 408, 409 
Hoglan, 


Hollingshead, Augustus B., 366 
Hollingsworth, Leta S., 284 
Holzinger, Karl J., 418 
Hopkins, Earnest Martin, 28 
Hopkins, Mark, 116 

Horn, Alice, 110, 111, 286, 287 
Horst, Paul, 185 

Howerth, I. W., 8 

Hubbard, Henry D., 247 
Hudelson, 43 

Huey, 53 
Hull, Clark L., 54, 351, 352 
Hulten, C. E., 43 
Hunnicutt, Clarence W., 323 
Hurlock, Elizabeth B., 321, 322, 326 
Huxley, Julian, 7 

Hyde, M. F., 360 


Trion, Arthur I., 327 
Irwin, J. 0., 57 
Ivins, Lester S., 401 


Jackson, Robert W. B., 125 
Jacobs, Robert, 135, 162, 191, 246 
Jahoda, Marie, 25, 51, 398, 425 
James, William, 15 

Jefferson, Thomas, 347 

Jesus, 347 

Jevons, W. Stanley, 30, 31 
Johnson, Bess E., 311 

Johnson, Franklin W., 39 
Johnson, Palmer O., 25, 421 
Jones, Edward Safford, 200 
Jones, Harold E., 22, 310, 312 
Jones, Lyle, 418 

Jones, Vernon, 47 


47 


Jordan, A. M., 162, 191, 245 
Judd, Charles H., 368 


Kandel, Isaac L., 38, 193, 400 
Katz, David, 9 
Kavruck, Lc 55, 162 
Bay, Marjorie E., 337 
ley, Truman L 52-54, 56, 115, 116, 
124, 132, 161, 170, 229, 273, 290, 296, 
29; 


Kelley, V. H., 176 

Kelvin, Lord, 7 

Kemble, Edwin C., 25 

Kendall, Maurice G., 105 

Kepler, Johann, 7 

Keys, Noel, 22, 311, 312, 330 

Khayyám, Omar, 4 

Kilpatrick, William HL, 17, 18, 356 

King, Harold V., 422 

King, Ronold, m 

Kinney, L. B., 

Kirkpatrick, dones je 310 

Kitch, Loran V. 

Kostick, Max M., 2, 152, 206 

Kraeplin, Emil, 3 

Kruglov, Toren 152 

Kuder, G. F., 51, 124, 129, 219, 416, 417, 
419, 421 

Kugle, 310 

Kuhlmann, Frederick, 2 182, 258, 288 

Kulp, Daniel H., 22, 3 


Laird, Donald A., 

Lampson, Edna É., 321 

Langmuir, Charles’ R., 103, 134 

Langmuir, Irving, 133 

Lawler, Eugene S., 249 

Learned, William 8., 350 

Leary, B. E., 141 

Lee, J. Murray, 464, 165, 180, 197 

Lee, Richard E 

Lefever, D. Waits, 371 

Lehman, Harvey C., 51 

Leiter, Russell , 422 

Lennon, Roger T., 278, 287 

Lentz, Theodore F, Jr, 47, 154 

Leonard, E. A., 

Leonard, 2? Paul 331, 398 

Leonard, Sterling A., 171 

Lev, Joseph, 105 

Lewin, Lillie, 246 

Ligon, Ernest M., 228, 230 

Lincoln, Abraham, 375, 415 

Lindquist, E. F., 22, 25, 55, 58, 105, 119, 
121, 127, 135, 139, 142, 150, 154, 156, 
160, 162, 165, 180, 185, 186, 191, 197, 
206, 245, 27, 295, 300, 397, 345, 346, 
371, 417, 4 

Lindvall, Cel M, 398 

Lindzey, Gardner, 176, 190, 244, 421 

Lockhart, Aileene, 22 

Locy, William А., 8 

Longstaff, H. P., 311 

Lord, Frederic, 425 

Lorge, Irving, 25, 135, 152, 278 

Lundholm, H. I 108, 1 178 


70 AUTHOR INDEX 


McAllister, Jane E., 22 

McCall, William A, 17, 44, 53, 54, 227, 
279, 395, 361, 362, 377 

McClelland, David C., 327 

McConn, Max, 28 

McConnell, James, 51 

MeCullough, Constance M., 345 

MeGeoch, John A., 327 

McGregor, J. B., 195 

McKinney, H. T. 

McNamara, Walier A 135, 160, 162, 181, 


м Nomar, Quinn, 32, 52, 91, 108, 110, 
111, 278, 284, 286, 287, 4 429 
Mackenzie, Gordon N., 366 
MacKinnon, Donald W., 327 
Madison, James, 402 
Maller, Julius ard TR 
Malthus, Thomas R., 
Mann, С. R., 150, 156, 1085, 180, 185, 186, 
197 
Mann, Horace, 29, 38, 45 
Mann, William A, 345 
Marconi, Guglielmo, 11 
Marston, Wiliam M. , 49 
Martin, Abe, 19 
Massey, Frank J. Jr., 105 
Mather, Kirtley F., 6 
Mathews, C. O., 49 
Maxfield, Francis N., 57, 58 
Maxwell, James Clerk, 11 
My Mark A., 47, 43° 
Мате Joseph. R., 9 
Melby, Ernest O., 365 
Mendel, Gregor J., 8 
Menninger, Karl Å., 299 
Merrill, Maud A., 34, 282, 288-290, 351 
Messenger, Helen R., 304, 406 
Meumann, Ernst, 31 
Meyer, George, 324 
Meyer, Max, 39 
Miller, W. S., 287 
Mills, ‘Lewis H., 168 
Miner, James В. 51 
Mitchell, Claude, 319 
Modley, "Rudolp h, 273 
Mollenkopf, William G., 371 
Monroe, Walter S., 43, 44, 49, pe 58, 59, 
113, 114, 165, 198, 200, 297, 3 
Moody, Caesar B., 59 
Moore, Bruce V., 31, 54 
Morgenstern, Oskar, 423 
Morrisett, L. N., 407 
Mort, Paul R., 394-396 
Mosier, Charles I., 180 
Mowrer, О, Hobart, 327 
Miiller, Johannes, 8 
Murphy, Gardner, 348 
Myers, George E., 368 
Myers, M. Claire, 180 


Neale, M. G., 404 

Nelson, M. db 183 

Nemzek, Claude L., 284 
Neumann, John Von, 423 
Newcomb, "Theodore M., 327 


AUTHOR INDEX 


Newens, Lyndall Fisher, 196 
Newland, T. Ernest, 337 
Newman, Sidney H., 341 
Nixon, Belle M., 29, 152, 206 
Noll, Victor H., 311 

Norton, John K., 249 
Norvell, Lee, 321 

Nowlis, Vincent, 327 


Odell, C. W., 59, 105, 135, 162, 191, 206 

Oden, Melita H., 366 

Ogburn, William F., 9, 10 

Olson, Helen F., 197 

Omwake, K. T., 178 

Orata, Pedro D., 376 

Orleans, Jacob S., 361 

Osler, Sir William, 338 

Otis, Arthur S., 36, 110, 111, 220, 221, 286, 
287, 319 

Otto, Henry J., 348, 358, 359, 364, 365 

Oxtoby, Toby, 372 


Pace, C. Robert, 374, 377, 398, 399 

Panlasigui, Isidoro, 315 

Parten, Mildred B., 52 

Partridge, E. DeAlton, 360 

Paterson, Donald G., 35, 37. 

Payne, William H., 16 

Pearson, Karl, 8, 31, 33, 48, 85, 86, 102, 
104, 109, 433, 435 

Pease, Glenn R., 313, 314 

Peattie, Donald Culross, 9 

Peel, E. A., 327 

Peters, Charles C., 243 

Peterson, Joseph, 59 

Phillips, Alexander J., 156 

Pieper, C. J., 168, 171, 177 

Pintner, Rudolph, 35, 37, 44, 59, 287 

Planck, Max, 7, 12, 26, 132 

Plato, 9, 14, 351 

Poffenberger, Albert T., 299 

Postman, Leo J., 327 

Pressey, Sidney L., 49, 129, 312, 327, 340, 
363, 364, 366 - 

Price, Helen G., 180 

Pythagoras, 9 


Quetelet, 10 


Raths, Louis E., 141, 142 
Reavis, W. C., 398 
Readr Ward Go 
eeve, Ethel В... 
Remmers, Hermann H., 49, 116, 206, 312, 


313 Jae 
Rice, Joseph M., 20, 21, 38, 45, 57, 59 
Richardson, Helen M., 3 
Richardson, Marion W., 124, 219, 416 
ene d Bord ghe 
Riley, John L., 45, 46, : 
Rinsland, Henry Daniel, 165, 168, 171, 

173, 202 т 
Roens, Bert A., 371, 415 
Rogers, Carl R., 370, 371 
Rorschach, Hermann, 421 
Roseborough, Mary E., 421 


471 


Ross, C. C., 133, 240, 311, 317, 319, 
oss, C. C , 306, 7, 31 


Rothney, John W. M., 371, 372, 415, 422 

Rousseau, Jean Jacques, 

Ruby, Lionel, 26 

Ruch, G. M., 54, 59, 156, 163-165, 170, 
193, 195, 294 

Rugg, Harold O., 17, 59 

Rulon, Phillip J., 111, 134, 278, 300, 417, 
419, 420 

Russell, Bertrand, 5, 6, 11, 12, 19, 26 

Russell, David H., 327, 373 

Ryan, T. M., 182 

Ryan, W. Carson, Jr., 194 

Ryans, David G., 31, 246 


Sacks, Elinor L., 133 

Sarton, George, 26 

Saucier, W. A., 24 

Saupe, Joe L., 398 

Sauvain, Walter Howard, 360 

Scates, Douglas E., 15, 21, 24, 60, 127, 133, 
249, 264, 206, 338 

Schrader, William B., 103, 371 

Schrammel, H. E., 182 

Schutte, T. H., 22 

Schwab, Joseph J., 135 

Schwiering, O. C., 

Scott, Walter D., 48 

Scott, William O. N., 415 

Scruggs, Sherman D., 330 

Seashore, Carl, 37, 54 

Seashore, Harold G., 134. 245, 300, 371 


419 
Segel, Daid, en cl 294 
Segui ouar 
Sells, Saul B., 398 
Selover, Margaret, 135, 162, 191, 246 
Seyfert, Warren C., 415 


Shaffer, Laurance F., 120 
Shakespeare, William, 121 


Sharp, George, 421 
Shaw, Duane C., 38 
Sherman, N. H., 174 
Shih, Hu, 4 

Shore, 309 

Siceloff, L. P., 168, 177 
Simpson, Robert G., 346 


Sims, Verner M., 196, 204, 374 

Singer, Arthur, 421 

Smart, Harold R., 5, 14 

Smeltzer, C. H., 312 

Smith, B. Othanel, 5, 11, 48 

Smith, Eugene R., 117, 140, 141, 374, 415, 


422 

Sones, W. W. D., 168, 189 

Spanney, Emma, 177 

Spaulding, Geraldine, 183, 188, 246 

Spear, Mary Eleanor, 249, 253, 272, 273 

Spearman, Charles E., 8, 31, 52, 54, 99, 
123, 194, 435 

Spence, Ralph B., 209, 210 

Stalnaker, John M., 135, 139, 192, 194, 
195, 203-206 

Stanley, Julian C., 55, 118, 124, 162, 206, 
244, 284, 300, 371, 417, 420, 422, 452 


172 


Starch, Daniel, 40, 41, 53 
i 246 


WINE. Wil, 52, 55, 191, 246 
Stern, Wilhelm, 3 

Sternberg, Jack 1 419 

Stevens, 8. Smith, 26 

Stoddard, George’ ae 54, 168, 225, 287 
Stone, С. "W., 39, 168, 331 

Stouffer, SR 24 

Strang, Ruth M., 345, 372 

Strayer, George р, 395 

Strong, E. К., Jr., 51 

Strunk, Mildred, 52 

Stuit, Dewey В., 346 

Stutsman, Rachel, 289 

Sueltz, Ben A., 327 

Super, Donald. E., 58, 246, 372 
Sutherland, A. A. 

Sykes, Gries M pm 

Brands, Percival, 48, 54, 165 
Swan, J. N., 276 


Taba, Hilda, 305 

Tallmadge, Margaret, 323 

Terman, Lewis M., 30-34, 53, 108, 110, 
1170170, 282, 279, 280, "289-984, 286- 
288, 290, 299, 345, 351, 363, 366 

Terry, Paul W., 323 324” 

Thelen, Herbert a з 94 421 

Thisted, M. N., 

"Thompson, DUE G., 323 

Thompson, Loring M., 273 

Thorndike, Edward Th 17, 30, 33, 34, 38, 
39, 43, 51-55, 113, 421, 135, 237, 260; 
279, 295, 323, 376 

Thorndike, Robert d 906, 284, 300, 400, 


"Thurstone, Don 37, 38, 49, 52, 59, 144, 
221, 281, 289, 

Thurstone, Thema Gwinn, 28 

Tiedeman, David V., 300, 371, 308, Ais. 419 

Tiegs, Ernest, W., 175, 5, 358 

Tilton, John W., 93 

Tinkelman, Sherman, 152 

Tippett, L. H. C., 105 

Todhunter, Isaac, 19 

Torgerson, Warren 8., 206 

Townsend, Agatha, 135, 162, E 210 372 

"Travers, Robert M. W. 135, 162, 

Traxler, Arthur E., 103, 127-129, 138, 162, 
183, 191, 195, 211, 243, 246, 300, 334, 
341, 344, 345, 372, 3m, 39 8,4 

Trow, William Clark, 

Troyer, Maurice E., b eun 

Tsao, Fei, 208 

Tucker, A. C., 240 

p Austin H., 22, 311, 312, 358, 360, 

362 


Turrell, Mela Moi 371 


Tyler, F. 
Tyler, Ralph W, 2 21, 114, 115, 117, 140, 
141, 827, 338, 346, 374, 401, 415, 422 


Tyson, Robert, 9 


AUTHOR INDEX 


Updegraff, Ruth, 225 
Upton, Clifford B, 119 


Vaughan, Kenneth W., 245 
Vernon, M. D., 
рта, Philip E., 185, 176, 190, 221, 244, 


Volker, 47 
Votaw, David F., Sr., 154 


Waldrop, Babet 5., 417 
Walen Helen M., 10, 60, 85, 90, 105, 378, 


ibus, Charles Dudley, 37 

Washburne, John Noble, 249 

Washington, George, 20 

Watson, Goodwin, 48, 49, 422, 423 

Watts, Winifred, 40 406 

Weber, E , 31 

Wechsler, BAM 55, 215, 279, 282, 289, 
418, 492 

Weidemann, Charles C., 194, 196, 198, 
200, 203, 204 

Weitzel, Henry I., 371 

Weitzman, Ellis, ^135, 160, 162, 181, 191 

Wesley, E. B., 182 

Wesman, Alexander G., 101, 103, 134, 245, 
971, 419 

Westaway, F. W., 7, 12, 13, 132 

Wheeler, L. R., 277 

White, Clyde W., 306 

White, Emerson E, 29, 30 

White, Hubert B., 314 

Whitehead, Alfred North, 4, 8, 9 

Whitney, Frederick E, 26, 399 

Wilbur, Ray men, 338 

Wilder, М,, 311 

Wi Elmer Harrison, 362 


, Paul, 51, yo» 366 

Wol; e, Dael, 372 

Wood, Ben D, 54, 276, 350 

Woodworth, Robert S., 49 

Woody, Thomas, 356 

Worcester, B AT 201, 246 

Wright, W. H. E., 174 

Wrightstone Ј. Wayne, па it 187, 200 
Wrinkle, William L., 410, 4 15 

Wriston, Henry M., '876 

Wundt, Wilhelm, 30, 38 

Wyndham, Harold S., 358, 859 


тоша. Kimball, 59 
Yule, G. Udny, 105 


Ziegfeld, Edwin, 377 E 
Zook, George F., 382 


Subject Index 


Abilities of Man, Spearman, 54 
Ability, defined, 276 
Ability groups, 357-65 
acceleration, 363-65 
arguments for and against, 358 
continuous promotion, 365 
experimental evidence, 358-60 
retardation, 363-65 
technique of grouping, 361-63 
Abnormal psychology. France and, 31-34 
Acceleration, 363-65 
Accomplishment quotient (AQ), 298 
Achievement: 
intelligence and, comparing, 296-99 
ear intelligence and, combining, 298- 


Achievement tests: 
defined, 23 
diagnostic, development, 44 
history, 38-46 
improved examinations, 45-46 
progress before 1918, 38-39 
progress since 1918, 43-45 
unreliability of school marks and ex- 
aminations, studies in, 3 
intelligence vs. aptitude, 418-19 
norms, use in interpreting scores оп, 
290-96 
objective type, development, 44 
practice, development, 44 
specific type, development, 44 . 
standardized and nonstandardized, ad- 
vantages and limitations, 216-17 
survey type, development, 44 
validation, 111-21 
criticisms, 113-14 
curricular versus statistical, 111-13 
direct vs. indirect methods, 115-17 
item analysis, 117-19 
standard tests, judging, 119-21 
Tyler’s suggestions, 114-15 
Activity movement, 35 
Administration: 
ease of, usability, 127-28 
procedure for, 228-30 а 
school, measurement in, function, 21-22 
test, 225-30 
time for, 225-27 
who should administer, 227 


Administrators, school, records and re- 
vorts for, 240-43 


е: 
chronological, see Chronological age 
educational, see Educational age 
increase, criterion of intelligence, 108 
mental, see Mental age 
promotion, see Promotion age 
subject age, see Subject age 

Agencies of publie information, ordinary, 


newspapers, local, 402-03 
student publications, 403-04 
Allport-Vernon-Lindzey Study of Values, 
176, 190, 421 
Alternative-response tests, 174-79 
advantages, 174-75 
construction, rules and suggestions for, 
178-79 
definition, 174 
illustrations, 175-78 
limitations, 174-75 
America, applied psychology and, 33-34 
American Council on Education Psycho- 
logical Examination, 326 
American Journal of Psychology, 52 
American Psychological ‘Association, 19, 
36, 416 
Annual reports, public relations, 404 
Answer keys, preparing, 158 
Application, ease of, usability, 129-30 
Applied psychology, America and, 33-34 
Applied sciences, measurement in, 11-12 
Appraising Vocational Fitness, Super, 55 
Aptitude Testing, Hull, 54 
Aptitude tests: 
achievement vs. intelligence, 418-19 
special, 37-38 
AQ, see Accomplishment quotient 
Army Alpha tests, 16, 36-37, 108 
Army Beta tests, 37 
Arthur Adaptation of Leiter International 
Performance Scale, 422 
Astronomy, measurement in, 6 
Ayers educational index, 115 


Bar graphs, 262, 263, 264 
Barrett-Ryan Literature Test: Silas Mar- 
ner, 182 


473 


474 


Bible, testing bred m Al A 
Bibliography of Ment "ests and Rating 

Scales, Hildreth, 54 * 
Biological Sciences, measurement in, 7-9 
Books, important, 53-55 


CA, see Chronological age 
California Achievement Tests, Advanced 
Battery, 175-76, 254—55, 256 
Ceiifornia Short-Form Test of Mental Ma- 
turity, 110, 287 
Capacity, defined, 276-77 
Central tendency, concept of test data, 69, 
73-74 
measure, 75 
Character Education Inquiry, 47 
haracter measurement, history, 46-52 
beginnings, 46 
development, 46-48, 51-52 
interview, 51 
questionnaire, 49-51 
rating scales, 48 
Children’s Apperception Test (CAT), 421 
inese civilization, testing devices in, 27— 
28 


Chronological age (CA), 279-80 
Circle graphs, 263 
Clapp-Young self-marking tests, 129 
Classification: 
ability groups, 357-65 
acceleration, 363-65 
arguments for and against, 358 
continuous promotion, 365 
experimental evidence, 358-60 
individual and group instruction, 357 
retardation, 363-65 
technique of grouping, 361-63 
activity movement, 356-57 
human variability, 347-56 
group differences, 348-50 
individual differences, 351-52, 353-56 
problem, 347-48 
trait variability, 352-53 
test results, 61-69 
rank order, 61-62 
Class interval, selecting, frequency distri- 
bution, 64 
Coefficient of correlation, see Product- 
moment coefficient of correlation 
Coherency, criterion of intelligence, 109 
Cole-von Borgersrode Scale for Rating 
Standardized Tests, 222-24 
College Board Review, 192 
College Entrance Board Examinations, 
192, 195, 204, 305 
Colorado experiment, reporting to par- 
ents, 410-11 
Column diagram, 258-59 
Committee on Standards for Graphic Pre- 
sentation, 272 
Completion tests, 170-74 
advantages, 170 
construction, rules and suggestions for, 
172-74 
definition, 170 


SUBJECT INDEX 


Completion tests (Cont.) : 
illustrations, 170—71 
limitations, 170 
Computations, concepts versus, quantita- 
tive data, 69-75 
Concentration, objective of instruction, 
145 


Concepts: 
computations versus, quantitative data, 
69-75 


co-relationship or concomitant varia- 
tion, 85-90 
Concomitant variation, concept. 85-90 
Concurrent validity, 417 
Confidence, Stage in use of tests, 57-58 
Congruent validity, 417 
Construction and Analysis of Achievement 
Tests, Adkin et al., 55 
Construction of tests, see Test construction 
Content validity, 417 
Continuous promotion, 365 
Co-operation, objective of instruction, 146 
Cooperative Achievement Tests, 112, 140, 
291 
Cooperative English Test, 171, 422 
Cooperative French Test. Junior Form 
1936, 183, 188-89 
Cooperative Plane Geometry Test, Re- 
vised Series Q, 177 
Cooperative Solid Geometry Tests, 178 
Cooperative Study of Secondary School 
Standards, 374-75, 376, 378-80, 
383, 384, 388, 390, 394, 395 1 
Cooperative Test of Social Studies Abili- 
ties, 182, 184, 187-88 
Co-relationship, concept, 85-90 
Correlation, 85-90 
coefficient, see Product-moment coeffi- 
cient of correlation 
rank, 95-98 
Cost. yee bility of measuring instrument, 


Creativeness, objective of instruction, 144 
Criterion problem, 417-18 
Critical caution, stage in use of tests, 58 
Crude scores, see Raw scores 
Curiosity, stage in use of tests, 57 
Curricular validity, 101 

statistical versus, 111-13 


Davis-Eells Games for Grades I-VI, 422 
Deciles, 75 
Decision-making, information and, 423-24 
Derived Scores, 288-90 
intelligence quotient, see Intelligence 
quotient 
Taw scores versus, 278-79 
Deviation: 
quartile, see Quartile deviation 
standard, see Standard deviation 
Diagnosing Personality ana Conduct. Sy- 
monds, 48 
Diagnosis, 328-45 
causes of errors, locating, 337-41 
individuals needing, locating, 332-33 
levels, 332, 


SUBJECT INDEX 


Diagnosis (Cont.) : 
nature of difficulty, locating, 333-37 
nature of educational, 328-30 
preventive, 345 
problem, 328-31 
remedial procedures, 341-45 
techniques, 332-45 
value in education, 330-31 
Differential Aptitude Tests, 418-19 
Differential prediction, 419 
Difficulty, measure, item-analysis, 440 
Discrimination, measure, item-analysis, 
437-40 
Dec table, item-analysis, 447— 
1 


Distribution, frequency, see Frequency 
distribution 

Draft, preliminary, preparing the test, 
147-48, 149 


EA, see Educational age 
Education: 
meaning of, 13-16 
measurement in, 13-25 
achievement tests, history, 38-46 
character, personality and interest 
tests, history, 46-52 
function, 21-22 
historieal development, 27-58 
intelligence tests, history, 30-38 
place of, 16-18 
publications, important, 52-55 
recent tendencies, 56-58 
types, 23-25 
views.of, 17 
reputation in, 19-20 
research in, 19, 20 
rhetoric in, 19 
three R’s in, 18-21 f 
Educational, Psychological, and Personality 
Tests of 1988-35, Buros, 54 і 
Educational Administration and Supervi- 
sion, 53 
Educational age (EA): 
educational quotient versus, 290-91 
limitations, 291-92 H 
uses, 291 Д 
Educational and Psychological Measure- 
ment, 53° À 
Educational Guidance, Kelley, 53 ; 
Educational Measurement, Lindauist, 22, 
55 
Educational Measurement, Starch, 53 
Educational program, evaluating, 384-94 
Educational quotient (EQ): 
educational age versus, 290-91 
limitations, 292-93 
use, 292-93 
Educational Records Bureau, 243 
Educational Test, Bureau, 257  . 
Hight-Year Study of the Progressive Edu- 
cation Association, 117, 140, 374 
Elementary schools, principles of evalua- 
tion for, 380-81 
England, statistical methods and, 31 
EQ, see Educational quotient 


ЕЯ 
~ 
[77] 


Error: 
causes of, diagnosis, 337-41 
measures of, 101-03 
measurement, 101, 102-03 
sampling, 101, 103 
technique, 101, 102 
Essay examinations, 192-205 
advanta, es, 196-97, 200-01 
grading by sorting, 204-05 
improving, suggestions for, 197-205 
advantages, special, 200-01 
construction, 198-09 
scoring, 202-04 
use, 198-99 
limitations, 193-95 
preparing students to take, 201-02 
reliability, 193-95, 196 
usability, 195, 196 
validity, 193, 196-97 
Essentials of Psychological Testing, Cron 
bach, 55 
Evaluation: 
in guidance, 367-71 
co-operative venture, 370-71 
importance, 367-68 
meaning, 367-68 
measurement in, 369-70 
of schools, 373-98 
. Cooperative Study of Secondary 
School Standards, 378-80 
difficulty, 376-77 
educational program, 384-94 
importance, 375-76 
index of variation, 397-98 
measurement and, 373-75 
organization and plant, 395-97 
philosophy of school, 383-84 
principles ot, general, 380-83 
problem, 373-80 
teaching efficiency, 377-78 
tests in, use, 379 
qualitative technique, 420-22 
semi-quantitative technique, 420-22 
test, 159-62 
Every Pupil Test in Physies, 187 
Examinations: 
essay, see Essay examinations 
school: 
improved, 45—46 
unreliability, studies in, 39-43 
Exhibits, school, publie information, 413 
Expectancy tables, 100-01 


Experimental psychology, Germany and, 
30-31 


Factor analysis, 418 
Fourth Mental Measurements Yearbook 
(1953), Buros, 55, 221 
classification of tests in, 218-219 
Forty-Fifth Yearbook of the National So- 
ciety for the Study of Education, 422 
France, abnormal psychology and, 31-34 
Frequency distribution or table, 62-69 
form, 66-67 
graphical representation, 258-64 
bar graphs, 262, 263, 264 


176 SUBJECT INDEX 


Frequeney distribution or table (Cont.) : Group (Cont.) 


graphical representation (Cont.) : di erences, 348-50 | 
circle graphs, 263 instruction, 357 огун 
column diagram, 258-59 Grouped frequency distribution, 64, 65 
frequency polygon, 259-60 Guidance, evaluation in, 367-71 
istogram, 258-59 co-operative venture, 370-71 
Pictographs, 263 importance, 367-68 
pie Pon 263 meaning, 367-68 
ewed curves, 262-63 measurement in. place of, 369-70 
Smooth curve, 260-62 
Symmetrical curves, 262-63 Heinis Mental Growth Units, 288 
typewriter graphs, 263 Higher institutions, principles of evalua- 
which graph is best?, 263-64 tion for, 382-83 
making, 64-66 Histogram, 258-59 
Scattergram or scatter diagram, 67-69 Holzinger-Crowder Uni-Factor Tests, 418 
two-way, 67-69 Homogeneous groups, see Ability groups 
equency polygon, 259-60 How to Measure in Education, McCall, 54 
Frequency table, see Frequency distribu- Hudelson Scale, 43 
tion Human variability; 


д educational significance, 347-56 
George Washington University English group differences, 348-50 
8 


Literature Test, 1 individual differences, 351-52 
тапу, experimental psychology and, educational provisions for, 353-56 
30-31 nature, 347-56 
Gestalt school of psychology, 9, 16-17 problem, 347-48 
rade norms: trait variability, 352-53 
limitations, 293-94 
use, 293-94 IBM General Purpose Answer Sheet, 159 
Grading, see Scoring Improvement of the Written Examination, 
Graphs, 247-73 Ruch, 54 
constructing, suggestions for, 271-73 Iudex of variation, evaluation of schools, 
distributions, two or more representing, 397-98 
264-71 Individual: 
central tendencies and variabilities of, differences, 351-52 
269-71 educational provisions for, 353-56 
central tendencies ot, 267-69 instruction, 357 
entire distributions, 264 rofile chart, 239 
frequency, see under Frequency distri- Information decision-making and, 423-24 


bution Initiative, objective of instruction, 144-45 
percentile curves, i 265-66 Instruction: 


polygons, use, 264-0, Б individual and group, 357 
ses distribution, representing, measurement, in, 303-425 


classification and promotion, 347-65 | 
bar graphs, 262, 263, 264 diagnosis, 328-45 

circle graphs, 263 evaluation in guidance, 367-71 

column diagram, 258-59 evaluation of Schools, 373-98 


histogram, 258-59 function, 21-22 ? 
pictographs, 263 motivation and practice in testing, 


pie graphs, 263 
polygon, 259-60 
skewed curves, 262-63 
smooth curve, 260-62 
symmetrical curves, 262-63 
typewriter graphs, 263 
Mes graph sod one 
record of an individual, represen 
254-58 SR nee 
profiles for series of subjects, 254-58 
profiles of single subject, 254 
value, 247-54 
attention getting, 247-49 
points clarified, 249 
retention aided, 249-54 
Gregory Tests in American History, 171 
roup: 
ability, see Ability groups 


303-27 
public relations, 400-15 
trends, present, 416-24 
objectives, categories, 141-42 
concentration, 145 
Co-operation, 146 
creativeness, 144 
initiative, 144-45 
interest, 145 
judgment, 145-46 
motivation, 145 
outcomes, provision for evaluating, 141— 
4 


relation of measurement to motivation 
in, 3 6 
teaching efficiency, evaluating, 377-78 
Intelligence: 
achievement and, comparing, 296-99 


SUBJECT INDEX 


Intelligence (Cont.) : 
meaning, bs ES " 
scores, achievement , combining, 
298-99 | 
Terman criteria, 108-09 
Intelligence quotient (IQ): 
advantages, 284-85 
computation, 281-83 
equating values, table for, 286 
interpretation, 283-84 
limitations, 285-88 
mental age vs., 279-80 
Intelligence tests: 
achievement vs. aptitude, 418-19 
defined, 23 
history, 30-38 
abnormal psychology, France and, 
31-34 
applied psychology, America and, 
33-34 


children of necessity, 34-37 
experimental psychology, Germany 
and, 30-31 
special aptitude tests, 37-38 
statistical methods, England and, 31 
norms, use in interpreting scores on, 
279-90 
validation, 108-11 
individual vs. group, 109-11 
meaning of intelligence, 108 
Terman criteria, 108-09 
Interest, objective of instruction, 145 
Interest measurement, history, 46-52 
beginnings, 46 
development, 46-48, 51-52 
interview, 51 
questionnaire, 49-51 
rating scales, 48 
International Institute of Teachers Col- 
lege, Columbia University, 194 
Interpretation, ease of, usability, 129- 
30 


Interpretation of Educational Measure- 
ments, Kelley, 54 З А 

Interviews, character, personality and in- 
terest measurement, 51 

Introduction to the Theory of Mental and 
Social Measurements, Thorndike, 53 

Iowa Academic Testing Program, 304 | 

Towa Erer Enni Tests in Basic Skills, 
1 


Iowa Placement Examinations, 37, 168 

Iowa Silent Reading Tests, New Edition, 
16, ez 255, 205 i 

IQ, sec Intelligence quotient — 

Item-analysis procedure, simplified, 436- 
53 


difficulty; measure, 440 
discrimination, measure, 437-40 
discrimination table, 447-51 
illustrative analysis, 440-47 

items, preparing, 436-37 

mean, obtaining, 452 d 
reliability coefficient, obtaining, 452 
standard deviation, obtaining, 452 
yalidation, 117-19 


477 
пан, test: 
analysis, see Item-analysi proced 
preparing, 436-37 — ^ Procedure 
ranking, 454-55 


Journal of Applied P. , 53 
Journal of Consulting sychology, 53 
Journal of Educational Psychology, 52 
Journal of Educalional Research, 53 
Journal of Genetic Psychology, 52 
Judgment, objective of instruction, 145-46 


KR No. 20, 124 

Kuder and Richardson formulas, 124 
Kuder Preference Record, 51, 129, 421 
Kuhlmann-Finch Intelligence Tests, 182 


L' Année Psychologique, 31, 52 
Learning: 


amount and quality, relation of meas- 
urement to, 309-23 
as of final examination, 312- 
frequency of tests, 309-12 
knowledge of results combined with 
other incentives, 310-23 
knowledge of test scores, 315-19 
motivation in, relation of measurement 
to, 306-25 
м relation of measurement, to, 
Leiter International Performance Scale, 
Arthur Adaptation, 422 
Letters to parents, 410-13 
Colorado experiment, 410-11 
suggestions for, 410 
University of Chicago High School Sys- 
tem, 411-13 
Limitations: - 
alternative-response tests, 174-75 
completion tests, 170 
educational age, 291-92 
educational quotient, 292-93 
essay examinations, 193-95 
grade norms, 293-94 
intelligence quotient, 285-88 
matching tests, 186 
measurement, 12-13 
mental age, 280-81 
multiple-choice tests, 180-81 
simple-recall tests, 167 
Limits of classes, determining, frequency 
distribution, 64 
Local norms, value, 295-96, 297 


MA, see Mental age —— AW ON 
Marks, school, unreliability, studies in, 
39-43 
Matching tests, 186-90 
advantages, 186 T 
construction, rules and suggestions, 
189-90 
definition, 186 
illustrations, 187-89 
limitations, 186 


478 SUBJECT INDEX 


Mean, 73 
computing, from scattergram data, 94 
finding, 78-81 
median versus, 74-75 
obtaining, item-analysis, 452 
short way to compute, 80 
Measurement: 
errors in, controlling, 13 
errors of, measures, 101, 102-03 
importance, 3-4 
in applied sciences, 11-12 
in biological sciences, 7-9 
in education, 13-25 
achievement tests, history, 38-46 
character, personality and interest 
tests, history, 46-52 
function, 21-22 
historical development, 27-58 
intelligence tests, history, 30-38 
place of, 16-18 
publications, important, 52-55 
recent tendencies, 56-58 
types, 23-25 
views of, 17 
in guidance, 369-70 
in instruction, 303-425 
Sige and promotion, 347- 
5 
diagnosis, 328-45 
evaluation in guidance, 367-71 
evaluation of schools, 373-98 
motivation and practice in testing, 
303-27 
publie relations, 400-15 
trends, present, 416-24 
in modern world, 3-25 
in physical sciences, 6-7 
in science, 4-13 
in social sciences, 9-11 
limitations, 12-13 
problem of, 3-134 
* generalizations regarding, 131-34 
relation to amount and quality of learn- 
ing, 300-23 
vido nd of final examination, 312- 
5 


frequency of tests, 309-12 
knowledge of results combined with 
other incentives, 319-23 
knowledge of test scores, 315-19 
relation to motivation, 304 
in learning, 306-25 
in teaching, 304-06 
relation to type of learning, 323-25 
teacher and, purpose, 304 
teaching emphasis and, 304-06 
Bh in Higher Education, Wood, 


Measurement іп Secondary Education, 
Symonds, 54 

M ано of Adult Intelligence, Wechs- 
er, 55 

Measurement of Intelligence, Terman, 34, 


Meana meni of Intelligence, Thorndike, 


Measuring instrument, satisfactory, char- 
acteristics, 106-34 
importance of problem, 106 
reliability, 121-27 
determining, methods, 122-24 
importance, 121-2 
interpretation, 124-25 
meaning, 121 
objectivity and, 125-27 
usability, 127-31 
administration, ease, 127-28 
application, ease, 129-30 
cost, 130 
interpretation, ease, 129-30 
meaning, 127 
mechanical make-up, 130-31 
Scoring, ease, 128-20 
validity, 107-21 
achievement tests, 111-21 
considerations, general, 107-08 
intelligence tests, 108-11 
meaning, 107 
Mechanical make-up of tests, usability, 
130-31 
Median, 73 
determining, from scattergram data, 94- 
95 
finding, 75-78 
mean versus, 74—75 
process of locating, 76 
Mental age (MA): 
concept, 32 
advantages, 280 
intelligence quotient versus, 279-80 
limitations, 280-81 


Mental Measurements Yearbooks, 54, 120, 


121, 218 
Mental quotient, 31 


Mental Testing, Goodenough, 55 


Methodology of Educational Research, Good 
et E 127 
Metropolitan and Standard Achievement 
"Tests, 291 
Midscore, 75-76 
Mind, 52 
Mode, 73 
finding, 75 
Modern School Achievement Tests, Lan- 
guage Usage, 182 
Motivation: 
experiment in, 306-07 
limitations on, 307-08 
types, 308-25 
importance, 303 
meaning, 303-04 
objective of instruction, 145 
practice effect, 326-27 
problem, 303-04 
relation of measurement to, 304 
in learning, 306-25 
in teaching, 304-06 а 
studies, educational implications, 325- 
26. 
for educational practice, 325-26 
for educational theory, 325 - 


SUBJECT INDEX 


Multiple-choice tests, 179-86 
construction, rules and suggestions for, 
184-86 
definition, 179-80 
illustrations, 181-84 
limitations, 180-81 
possibilities, 180-81 


National Coa of Teachers of English, 
National Education Association, 374, 380 
National norms, 295 
Natural sciences, measurement in, 8-9 
Nelson Eo School English Test, 181-82, 
Newspapers, local, agency of publie infor- 
mation, 402-03 
New York State Regents, 305 
Nineteen Forty Mental Measurements Year- 
book, 55 
Nonstandardized tests, standardized vs., 
274-15 
Normal curve, 260 
Normal Progress Chart, 257-58 
Norms: 
grade, 293-94 
intelligence and achievement, compar- 
. . ing, 296-99 
interpreting scores on 
tests, use, 200-96 
interpreting scores on intelligence tests, 
use, 279-90 : 
interpreting scores on personality tests, 
use, 299-300 
local, value, 295-96, 297 
national, 295 
percentile, 294-95 
scores, raw and derived, 276-79 
standards and, 274-76 
North Central Association of Colleges and 
Secondary Schools, 382 
Northwestern Intelligence Tests, 422 
Novel tests and items, 422-23 


achievement 


Objective tests: 
alternative-response, 174-79 
advantages, 174-75 s 
construction, rules and suggestions 
for, 178-79 
definition, 174 
illustrations, 175-78 
limitations, 174-75 
completion, 170-74 
advantages, 170 J 
construction, rules and suggestions 
for, 172-74 
definition, 170 
illustrations, 170-71 
limitations, 170 ow ae 
construction, principles, 
frequency of use by teachers, 163-64 
matching, 186-90 
advantages, 186 , 
construction, rules and suggestions, 
189-90 
definition, 186 


479 


Objective tests (Coht.) : 
matching (Cont.) : 
illustrations, 187-89 
limitations, 186 
multiple-choice, 179-86 
construetion, rules and suggestions 
for, 184-86 
definition, 179-80 
illustrations, 181-84 
limitations, 180-81 
possibilities, 180-81 
rearrangement, 190-01 
simple-recall, 167-70 
advantages, 107 
construetion, rules and suggestions 
for, 169-70 
definition, 107 
illustrations, 167-69 
limitations, 167 


bn 163 т 4 
va БТА and reliability, comparative, 


64-67 
Objectivity, reliability and, 125-27 
Occupations, 5 » k 
Official publications, public relations, 404- 


annual reports, 404 

special reports, 405-06 
Ogive, 260 - A 
Ohio State University Psychological Test, 

129 

Opinion, public, mobilizing, 414-15 
Organization, school, evaluating, 395-97 
Otis Quick-Scoring Mental Ability Test, 


287 

Otis Scales for Rating Standard Tests, 220 

Otis Self-Administered Test of Mental 
Ability, 287 

Otis Self-Administering Higher Examina- 
tion, 110-11 


Parents: 
letters to, 410-13. 
opinion of, sampling, 414-15 
reports to, 245 É ood 
Parent-teacher association, publie infor- 
mation, 413-14 | 
Parent-Teacher Association, 245 
PC, see Personal Constant 
PEA Interpretation of Data Test, 422 
Pedagogical Seminary, 52 
Percentile curve, 260, 2 
Percentile norms, 2 
Percentile rank, 
Herm Я ыа 
computation, A 
D, Shore of variability, 84-85 
Personal Constant (PC), 288 — 
Personality tests, norms, use 1n interpret- 
ing scores on, 299-300 
Personality Mgr pn history, 46-52 
beginnings, 
development, 46-48, 51-52 
interview, 51 
questionnaire, 49-51 
rating scales, 48 


480 SUBJECT INDEX 


Personnel and Guidance J ournal, 58 
Personnel Selection, Tests and Me easurement 
Techniques, Thorndike, 55 

Philosophy of theschool, evaluating, 383-84 

Physical sciences, measurement in 6-7 
ysics, measurement in, 7 

Pictographs, 263 

Pie graphs, 263 

Pintner General Ability Tests, 287 

Pintner-Paterson Performance Seale, 35, 


Planning the test, 140-47 mio 
с conditions, considering, 
14 


emphasis in course, reflecting proportion 
of, 146-47 
evaluating outcomes of instruction, pro- 
vision for, 141-46 
purpose to be served, considering, 147 
nt, school, evaluating, 395-97 
“Platform for the Use of Standard Tests,” 
211-12 
Polygons, use, graphical representation, 
264-65 , 


PrA, see Promotion age 

Predictive validity, 417 

Beat as aik preparing the test, 147- 
4 


Preparing the test, 147-55 
iru in ascending order of difficulty, 
151-52 


difficulty of items, 148-49 

directions to pupil, 153-54 

на type of items placed together, 
5l 


pattern of responses, &voiding regular 
Sequence in, 152 
phrasing of items, 149-50 
preliminary draft, 147-48 
preliminary draft items, 149 
revision, critical, 149 
‘types of items, 148 
whole content funetions in determining 
answer, 150-51 
Written record of responses, provision 
for, 152-53 
Preventive diagnosis, 345 
Primary Mental Ability tests (PMA), 219, 
418 


Principles of Science, Jevons, 30 
Product-moment coefficient of correlation, 
т, 86-90 
computing from Scàttergram, 93-94 
interpreting, 98-100 
magnitude or size, 98-99 
obtaining, 90 
relationship represented by, 89 
reliability, 101 
sign, 98 
validity, 101, 109 
Professional journals, 52-53 
Profile chart, individual, 239 
Profiles, 254 
Series of subjects, 254-58 
single subject, 254 
Warnings concerning, 243-45 


Progressive Education ‘Association, 117, 
140, 374, 422 

Promotion: 
acceleration and retardation, 363-65 
continuous, 365 

Promotion age (PrA), 298 

Promotion quotient (PrQ), 298 

PrQ, see Promotion quotient 

Psychograph, 254 

Psychology:  . 
abnormal, see Abnormal psychology 
applied, see Applied psyc. ology 
experimental, see Experimental psychol- 


ogy 
Gestalt school, 9, 16-17 
Psychology Colloquium, University of 
isconsin, 423 
Psychology of Musical Talent, Seashore, 54 
Psychometrika, 53 
Publications: 
important, 52-55 
books, 53-55 
professional journals, 52-53 1 
public relations, agencies of publie in- 
formation: 
newspapers, local, 402-03 
official, 404-06 
Student publications, 403-04 
Publicity, 401 
Publie opinion, mobilizing, 414-15 
Public relations, 400-15 р 
agencies of public information, ordinary, 
402-404 
newspapers, local, 402-03 
Student publications, 403-04 
letters to parents, 410-13 
Colorado experiment, 410-11 
Suggestions for, 410 і 
University of Chicago High School 
System, 411-13 
official publications, 404-06 
annua] reports, 404 
special reports, 405-06 
parent-teacher association, 413-14 
principal sources, 402 
problem, 400-02 
programs, meaning, 401-02 
public opinion, mobilizing, 414-15 
Nie eards, 406-12 
ill's study, 407-08 
trends in, 406-07 
School exhibits, 413 
School visitation, 413 з 
Public School Attainment Tests for High 
School Entrance, 171 
Publishers of standardized tests, 464-65 
Pupils, report to, 237 


Qualitative evaluation technique, 420-22 
Quantitative data: 
concepts versus computations, 69-75 
central tendency, 69, 73-74, 75 
eleven categories, 72 
four categories, 71 
grading dilemma, 69—70 
mean vs, median, 74-75 


SUBJECT INDEX 


Quantitative data (Cont.): 
concepts versus computations (Cont.) : 
one category, 70 
three categories, 71 
two categories, 70-71 
variability, 69, 71-73 
elementary notions concerning, 69-75 
mean, 73 
finding, 78-81 
median vs., 74-75 ` 
median, 73 
finding, 75-78 
mean vs., 74-75 
mode, 73 
finding, 75 
percentiles, computation, 78 
Quartile deviation, 72 
variability measure, 81-83 
Quartiles, 77 j. 
Questionnaires, character, personality and 
interest measurement, 49-51 
Quotients: 
accomplishment, see Accomplishment 
quotient 
educational, see Educational guotient 
intelligence, see Intelligence quotient 
mental, see Mental quotient | 
promotion, see Promotion quotient 
subject, see Subject quotient 


Range, 72 f 
determining, frequency distribution, 64 
variability measure, 81 

Rank correlation, 95-98 

Ranking test items, 454-55 

Rank order, 61-62 Н 

Rating scales, character, personality, and 

interest, measurement, 

Raw scores, derived scores YS., 278-79 

Rearrangement tests, 190-91 

Records, 236-45 
for administrators, 240-43 
graphical representation, 254-58 
to teachers, 238-40 

Relationship, measures, 85-101 
coefficient of correlation, 7, 

computing from scattergram, 93-94 
interpreting, 98-100 
obtaining, 90 
reliability coefficient, 101 
validity coefficient, 101 — : 
co-relationship or concomitant varia- 
tion, concept, 85-90 
expectancy tables, 100-01 
rank correlation, 95-98 
scattergram, constructing, 90-93 
means from, computing, 94 
medians, determining, 9 
r from, computing, 93-94 ; 
standard deviations from, computing, 
94 

Reliability: 

coefficient, 101 А 
obtaining, item-analysis, 452 


Reliability (Cont.): 
determining, methods, 122-24 
with one test form, 123-24 
with two test forms, 122 
essay examination, 193-95, 196 
importance, 121-22 
interpretation of test, 124-25 
meaning, 121 
objective tests, 164-67 
objectivity and, 125-27 
present trend, 416 
quality of satisfactory measuring instru- 
ment, 121-27 
Remedial procedures, diagnosis, 341-45 
Report cards, 406-12 
olorado experiment, 410-11 
Hill's study, 407-08 
trends in, 406-07 
University of Chicago High School Sys- 
tem, 411-13 
Lounges to parents, see Letters to par- 


en 
Reports, 236-45 

annual, public relations, 404 

for administrators, 240—43 

special, publie relations, 405-06 

to parents or publie, 245 

to pupils, 237 

to teachers, 238-40 
Reputation, in education, 19-20 
Research in education, 19, 20 
Results, test, see Test results 
Retardation, 363-65 
Retesting, 236 
Review of Educational Research, 22, 47 fn. 
Revised Stanford-Binet Intelligence Scale, 

55, 215, 282, 418 

Revision, critical, preparing the test, 149 
Rhetoric in education, 19 


Sampling, errors of, measures, 101, 103 
“Scale-Book,” Fisher's, 38 
Scaled test, defined, 24 
Scales: 
rating, see Rating scales 
test distinguished from, 24 
Scatter, see Variability 
Scatter diagram, see Scattergram 
Seattergram, 67-69 
constructing, 90-93 
means from, computing, 94 
medians, determining, 94-95 
negative correlation, illustrating, 87 
т from, computing, 93-94 
standard deviations from, computing, 


94 
School and Society, 53 
Schools, evaluation, 373-98 
Cooperative Study of Secondary School 
Standards, 378-80 
difficulty, 376-77 
educational program, 384-94 
importance, 375-76 
index of variation, 397-98 
measurement and, 373-75 
organization and plant, 395-97 


182 


Schools, evaluation (Cont.): 
ан of, 383-84 
principles of, general, 380-83 
for elementary schools, 380-81 
for higher institutions, 382-83 
for secondary schools, 381-82 
problem, 373-80 
teaching efficiency, 377-78 
tests in, use, 379 
Science, measurement in, 4-13 
Science Research Associates (SRA): 
Non-Verbal Test, 110-11 
nor Mental Ability tests, 110-11, 
8 
self-scoring test, 129 
Scientific method, 4-6 
Scores: 
analyzing and interpreting, 234-35 
definition, 276-78 
derived, see Derived scores 
intelligence and achievement, combin- 
ing, 208-09 
interpreting on achievement tests, use of 
norms in, 290-96 
interpreting on intelligence tests, use of 
norms in, 279-90 
interpreting on personality tests, use of 
norms in, 299-300 
raw vs. derived, 278-79 
sigma, 289-90 
standard, 85, 289-90, 295 
T-scores, 295 
Z-scores, 85, 289-90 
Scoreze, 129 
Scoring the tests, 230-34 
ease of, usability, 128-29 
essay examination, 202-04 
by sorting, 204-05 
procedure, 156-58 
Tules, preparing, 158 
techniques used, 231-34 
who should score, 230-31 
Scott Man-to-Man Scale, 48 
Seashore Test of Musical Talent, 37 
Secondary schools, principles of evaluation 
for, 381-82 
Selected References on Test Construction, 
Mental Test Theory, and Statistics, 
1929-49, Goheen-Kavruck, 55 
Semi-quantitative evaluation technique, 
420-22 
Seven Seals of Science, Mayer, 9-10 
Seventeenth Y earbook of the National Society 
for the Study of Education, 53 
Sigma scores, 289-90 
Simple-recall tests, 167-70 
advantages, 167 
construction, rules and suggestions for, 
169-70 
definition, 167 
illustrations, 167-69 
limitations, 167 
Skewed curves, 262-63 
Smooth curve, 260-62 
Social sciences, measurement. in, 9-11 


SUBJECT INDEX 


Sones-Harry High School Achievement 
Test, 168, 189 
Spearman-Brown formula, 124 
Special aptitude tests, 37-38 
Special reports and publications, public 
relations, 405-06 
Specifie determiners, 149 
Square roots, computation, 456-58 
Standard deviation, 72 
computing from seattergram data, 94 
obtaining, item-analysis, 452 
praetical uses, 85 
simplified way to compute, 84 
variability measure, 83-85 
Standardized tests: 
nonstandardized vs., 274—75 
publishers, 464-65 
Standards, norms and, 274-76 
Standard scores, 85, 289-90, 295 
Stanford Achievement Test, 44, 170-71 
Stanford-Binet scale, 33-34, 127 
Statistical analysis, test results, 60-104 
classification and tabulation, 61-69 
frequency table, 62-69 
rank order, 61-62 
considerations, general, 60-61 
error, measures, 101-03 
measurement, 101, 102-03 
sampling, 101, 103 
technique, 101, 102 
quantitative data, 69-75 
concepts versus computations, 69-75 
relationship, measures, 85-101 
coefficient of correlation, 86-90, 93- 
94, 98-100 1 
co-relationship or concomitant varia- 
tion, concept, 85-90 
expectaney tables, 100—01 
rank correlation, 95-98 
reliability coefficient, 101 
scattergram data, 90-95 
validity eoefficient, 101 
variability or scatter, measures, 81-85 
percentile D, 84-85 
quartile deviation, 81-83 
range, 81 
standard deviation, 83-85 
Statistical methods, England and, 31 
Statistical validity, 101 
currieular vs., 111-13 
Statistics: 
fifty questions, 429-35 
answers to, 459-63 
importance, 60 
in a capsule, 60-61 
Status validity, 417 
Stenquist Test of General Mechanical 
Ability, 37 
Stone Perroning Test in Arithmetic, 39, 
16 


Strong Vocational Interest Blank, 51 

Student publications, agencies of publie 
information, 403-04 

Subject age, 291 

Subject quotient, 291 

Symmetrieal curves, 262-63 


SUBJECT INDEX 


'T-scores, 295 
Tabulation, test results, 62 
frequency table or distribution, 62-69 
form, 66-67 
making, 64-66 
scattergram, 67—69 
two-way, 67-69 
Teacher: 
measurement and, purpose, 304 
records and reports to, 238-40 
Teachers College Record, 52 
Teaching, see Instruction 
Teaching emphasis, measurement and, 
304-06 
Technique, errors of, measures, 101, 102 - 
Terman criteria, intelligence, 108-09 
Toman Group Test of Mental Ability, 
79 
Terman-MeNemar Test of Mental Abil- 
ity, 110-11, 287 
Test construction, 189-205 
essay examination, 198-99 
evaluating the test, 159-62 
objective tests, principles, 163-91 
alternative-response, 178-79 
completion, 172-74 
frequency of use by teachers, 163-64 
matching, 189-90 
multiple-choice, 184-86 
simple-reeall, 169-70 
types, 163 | 
validity and reliability, comparative, 
164-67 
planning the test, 140-47 — E 
conditions of administration, consid- 
\ eration, 147 } 
` emphasis in course, reflecting propor- 
tion of, 146-47 А 1 
evaluating outcomes of instruction, 
provision for, 141-46 р z 
purpose to be served, consideration, 
147 
preparing the test, 147-55 | 
arranged in ascending order of diffi- 
culty, 151-52 
difficulty of items, 148-49 
directions to pupil, 153-54 
а type of items placed to- 
ether, 15 nn 
m: ern of responses, avoiding regular 
sequence in, 152 
phrasing of items, 149-50 
preliminary draft, 147-48 
preliminary draft items, 149 
revision, eritical, 149 
types of items, 148 — Ў 
whole content functions in determin- 
ing answer, 150-51 m 
written record of responses, provision 
for, 152-53 
principles, general, 139-62 
problem, importance, 139-40 
recent tendencies, 56-57 
trying out the test, 155-58 
answer keys, 158  . г 
conditions for. insuring normal, 155 


483 


Test construction (Cont.): 
trying out the test (Cont.): 
scoring procedure, 156-58 
scoring rules, 158 
time allowance, 155-56 
Testing program, 209-300 
administering the tests, 225-30 
procedure for, 228-30 
time for, 225-27 
who should administer, 227 
considerations, general, 209-12 
co-operative, 213 
definite, 214 
graphical representation, 247-73 
о suggestions for, 271- 


distributions, two or more, represent- 
ing, 264-71 

frequency distribution, representing, 
258-64 

record of an individual, representing, 
254-58 А 

value, 247-54 

norms, uses and limitations, 274—300 

intelligence and achievement, com- 
paring, 296-99 

interpreting scores on achievement 
tests, use, 200-96 

interpreting scores on intelligence 
tests, use, 279-90 

interpreting scores on personality 
tests, use, 299-300 

deos scores and derived scores, 276- 


7 
: standards and norms, 274-76 
plan for, elementary school, 210 
practical, 213-14 í 
profiles, warnings concerning, 243-45 
purpose of, determining, 212-14 
records, 236-45 
for administrators, 240-43 
; to teachers, 238-40 
reports, 236-45 
for administrators, 240—43 
to parents or public, 245 
to pupils, 237 
to teachers, 238-40 
results, applying, 235-36 
retesting, 236. р | 
scores, analyzing and interpreting, 234- 


35 
scoring the tests, 230-34 
techniques used, 231-34 
who should score, 230-31 
selecting tests, 214-24 
procedure for, 218-24 
type of tests, 215 
who shall select, 214-15 
steps in, 209-45 
Testing School Children, Stephenson, 55 
Test results, сац analysis, 60-104 
applying, 235-36 
ааа and tabulation, 61-69 
frequency table, 62-69 
rank order, 61-62 
considerations, general, 60-61 


484 SUBJECT INDEX 


"Test results (Cont.): 
error, measures, 101-03 
measurement, 101, 102-03 
sampling, 101, 103 
technique, 101, 102 
quantitative data, 69-75 
concepts vs. computations, 69-75 
mean, finding, 78-81 
median, finding, 75-78 
mode, finding, 75 
percentiles, computing, 78 
relationship, measures, 85-101 
coefficient of correlation, 86-90, 93— 
94, 98-100 Я 
co-relationship or concomitant varia- 
tion, concept, 85-90 
expectancy tables, 100-01 
rank correlation, 95-98 
reliability coefficient, 101 
seattergram data, 90-95 
validity coefficient, 101 
variability or scatter, measures, 81-85 
percentile, D, 84-85 
quartile deviation, 81-83 
range, 81 
Standard deviation, 83-85 
Tests: 
achievement, see Achievement tests 
administering, 225-30 
alternative-response, see Alternative re- 
Sponse tests 
aptitude, see Aptitude tests 
completion, see Completion tests 
construction, see Test construction 
essay examinations, see Essay examina- 
tions 
evaluating, 159-62 
intelligence, see Intelligence tests 
matching, see Matching tests 
multiple-choice, see Multiple-choice 
tests 
Wu eA e. see Nonstandardized 
ests 
novel, 422-23 
objective, see Objective tests 
planning, 1 7 
preparing, 147-55 
rearrangement, see Rearrangement, tests 
reliability, see Reliability 
results, see Test results 
scale distinguished from, 24 
Scores, see Scores 
scoring, see Scoring the tests 
selecting ot eee 214-24 
simple-recall, see Simple-recall tests 
standardized, see Standardized tests 
time allowance for, 155-56 
trying out, see Trying out tests 
usability, see Usability 
validity, see Validity 
Tests and Measurements in Н: igh School In- 
struction, Ruch and Stoddard, 54 
Tests in English Fundamentals: Gram- 
mar, 176-77 
Tests of General Educational Develop- 
ment (GED), 422 


Tests on Everyday Problems in Science: 
Unit Ш, 177-78 
Thematic Apperception Test (TAT), 421 
Theory and Practice of Psychological Test- 
ing, Freeman, 55 
Theory of Mental Tests, Gulliksen, 55 
Third Mental Measurements Yearbovl: 
(1949), 55 
Thorndike Handwriting Scale, 39 
Thorndike-McCall Reading Test, 279, 295 
Three-Year Study of Commission on 
Teacher Education, 374 
Time allowance for test, 155-56 
Trait variability, 352-53 
Traxler Silent Reading Test, Word mean- 
ing, 183 
Trying out the test, 155-58 
answer keys, 158 
conditions for, insuring normal, 155 
scoring procedure, 156-58 
scoring rules, 158 
time allowance, 155-56 
Two-way frequency table, 67-69 
Typewriter graphs, 263 


Ungrouped series, 61 
Unit Scales of Attainment in Foods and 
Household Management, 183 
University of Chicago High School System 
of reporting, 411-13 
Usability: 
administration, ease, 127-28 
application, ease, 129-30 
cost, 130 
essay examination, 195, 196 
interpretation, ease, 129-30 
meaning, 127 
mechanical make-up, 130-31 
quality of satisfactory measuring instru- 
ment, 127-31 
recent tendencies, 57-58 
Scoring, ease, 128-29 
Utilizing Human Talent, Davis, 55 


Validity: 

achievement tests, 111-21 
criticisms, 113-14 
curricular vs. statistical, 111-13 
direct vs. indirect methods, 115-17 
item analysis, 117-19 
standard tests, judging, 119-21 
"T yler's suggestions, 114-15 

coefficient, 101 

considerations, general, 107-08 

curricular, 101 
Statistical vs., 111-13 

essay examination, 193, 196-97 

intelligence tests, 108-11 
individual vs. group, 109-11 
meaning of intelligence, 108 
Terman criteria, 108-09 

meaning, 107 

objective tests, 164-67 

quality of satisfactory measuring instru- 

ment, 107-21 


SUBJECT INDEX 


Validity (Сот.): 
statistical, 101 
curricular versus, 111-13 
types, 416-17 
Variability : 
concept of test data, 69, 71-72 
human, see Human variability 
meaning, 81 
measures, 81-85 
percentile, D, 84-85 
quartile deviation, 81-83 
range, 81 
standard deviation, 83-85 
vocabulary of, 72-73 
Variations, concomitant, see Concomitant 
variation 
Visitation, school, public information, 413 


485 


Watson-Glaser Critical Thinking Ap- 
raisal, 422-23 
Wechsler-Bellevue Intelligence Scales, 55, 


215 
Wechsler Intelligence Scale for Children 
(WISC), 55, 418, 4 
Wesley Test in Political Terms, 182 
Wholistic Гия 420 
Woodworth Personal Data Sheet, 49 
World success, criterion of intelligence, 109 
World War I, 35-37, 
World War II, 37, 55, 363 


X-O Test, 49 


Z-scores, 85, 289-90 


